Instruction Driven Cross-Layer CNN Accelerator with Winograd Transformation on FPGA

Size: px
Start display at page:

Download "Instruction Driven Cross-Layer CNN Accelerator with Winograd Transformation on FPGA"

Transcription

1 Instruction Driven Cross-Layer CNN Accelerator with Winograd Transformation on FPGA Abstract In recent years, Convolutional Neural Network (CNN) has been widely applied in computer vision tasks. FPGAs have been widely explored to accelerate CNNs due to its high performance, high energy efficiency, and flexibility. By fusing multiple layers in CNN, the intermediate data transfer can be reduced. With a faster algorithm using Winograd transformation, the computation of convolution can be further accelerated. However, previous accelerators with cross-layer or Winograd algorithm are designed for a particular CNN model. The FPGA should be reprogrammed when running another CNN model on the hardware. In this work, we design an instruction driven CNN accelerator supporting Winograd algorithm and cross-layer scheduling. We firstly modify the cross-layer loop unrolling order to extract basic operations as instructions, and then improve the on-chip memory architecture for higher computation units utilization rate in Winograd computation. We evaluate the hardware architecture and scheduling policy on Xilinx Virtex-7 690t FPGA platform. As a case study, the intermediate data transfer can be reduced by 90% on VGG- D CNN model with cross-layer policy. The performance of our hardware accelerator reaches 1467 GOP/s with 190k logic cells. Experimental results show that compared with state-of-the-art accelerators on the same platform, our design achieves 1.5 speed up with the same number of logic cells. The performance can be further improved by 78% if larger Winograd transformation sizes are used. Index Terms FPGA, CNN, Winograd, Instruction, Crosslayer I. INTRODUCTION Convolutional neural network (CNN) has achieved significant improvement in image recognition and detection and is thus widely applied in computer vision tasks, like object tracking and people re-identification [1]. CNN has even proved its potentials in regions other than image processing, such as language processing [2]. However, CNN requires a large computation complexity and a huge storage capacity compared with conventional algorithms. For example, convolution layers of VGG-E [3] have 20MB parameters. A single forward pass of VGG-E needs more than 39G operations. Thus, specific hardware architectures are designed to accelerate CNN based on ASIC [4] [7] or FPGA [8] [10]. The speed of an accelerator is the key measurement for a CNN accelerator design. The parallelism which is proportional to the number of computation units, and the utilization ratio which is affected by the memory system decides how fast a CNN accelerator can be. The energy cost of each operation, equally the speed divided by the power is used to measure the efficiency of a CNN accelerator design. The energy cost consists of two parts, memory access energy and computation energy. By optimization on computation and data access behavior, both speed and energy efficiency of a CNN accelerator can be improved. Besides that, the flexibility of a CNN accelerator is important to support various network structures. Fast convolution algorithms, such as Winograd transformation [11] have been used to speed up the operation of CNN. The most commonly used convolution kernels are 3 3 and Winograd transformation for 3 3 convolution with 4 4 transformation tile mathematically saves 56% computation, which means 2.25 speed up over a straight forward implementation. [12] designs an architecture using Winograd to accelerate CNN. However, the design will be no longer efficient if the cross-layer is to be applied scheduling due to the long pre-loading time of the memory system (line buffer). The large feature map and great number of weight for CNN requires large memory such as off-chip DRAM. Off-chip data transfer contributes to most of the memory system energy and sometimes stalls the computation units as its bandwidth is quite limited. Traditional implementations of CNN schedule the network layer by layer. In this way, the result of each layer should be written back to external memory when onchip memory is limited. For example, in VGG-D model, this number is 18.4M. The transformation time of this part is 20% of the computation time [10]. [10] proposes a scheduling method to compute different layers at a time so that the transfer of intermediate data can be largely reduced. However, this design is for a specific network and how to merge the layers is not fully discussed. To address the above problems, we propose an instruction driven CNN accelerator with the following contributions: We optimize cross-layer strategy for instruction support and propose a network dividing method to minimize intermediate data transfer. We improve the on-chip memory architecture for highefficiency Winograd with cross-layer scheduling. We design an accelerator system for CNN supporting instructions, cross-layer, and Winograd. Experimental results show that our design is 7 faster than previous cross-layer FPGA accelerator [10], and our logic cells efficiency is 1.5 higher than the state-of-the-art FPGA CNN accelerator. This rest part of this paper is organized as follows. Section II introduces the background of CNN and Winograd transformation. Section III introduces the previous work. Section IV illustrates the overall mapping flow.section V proposes a network dividing method to minimize intermediate data

2 transfer. Section VI introduces our loop unrolling strategy to implement cross-layer scheduling based on instructions. Section VII improves the buffer structure and implements our CNN hardware architecture supporting instructions, crosslayer, and Winograd. Section VIII compares the results with the state-of-the-art FPGA accelerators. Section IX concludes our paper. A. Basic Operations of CNN II. BACKGROUND A typical CNN consists of a number of layers running in sequence. The data between different layers is called featuremaps. There are kinds of layers in CNN: convolution layer, nonlinear layer, pooling layer and others. Convolution layer processes 2-d convolution with trained weights. Convolution layers are usually cascaded to extract high-level features of input. Convolution layers consume most of the computation of CNN. Nonlinear layer applies nonlinear activation function to each element of the output feature maps of convolution layers. Nonlinear layers increase the fitting ability of CNN. Rectified Linear Unit (ReLU) is a widely used nonlinear activation function in CNN. Pooling layer helps increase the receptive field of output element and reduce the computational complexity of CNN. A pooling layer outputs the max or average value of all elements in an area of the input featuremaps. B. Introduction of Winograd Convolution layers consume most of the computation in CNN. There are several methods to reduce computation complexity of 2-d convolution. Winograd s minimal filtering algorithm is a fast method to compute 2-d convolution especially when the convolution filters go small like 3 3 in CNN, and the Winograd algorithm uses a transformation to compute convolution [11]. For example, to compute a convolution for which the image tile is 4 4 and the kernel tile is 3 3, the Winograd algorithm uses 16 multiplications. However, the standard algorithm of the same size uses 36 multiplications. The Winograd algorithm achieves a 2.25 speedup with the same number of multiplications [13]. Winograd algorithm can be written in matrix form as: Y = A T [( GgG T ) ( B T db )] A (1) Where the indicates elementwise multiplication. The matrices are: [ ] B T = G = ] [ A T = [ ] g is the 4 4 input tile, d is the 3 3 convolution kernel. The multiplication with these transformation matrix elements can all be converted to shift operations(like 1 2 ), which is very friendly to hardware. The larger the Winograd tile is, the acceleration rate of Winograd is higher. However, transformation matrices of a (2) tile greater than 4 4 contain elements that cannot be simply converted to shift operations( like 1 5 in a 5 5 Winograd tile). For this reason, some approximation should be done ( = ). The approximation will result in precision degradation in CNN, especially in detection tasks. We set our Winograd tile size 4 4 to keep the precision of CNN, with a 2.25 acceleration rate to the original standard convolution. A. Architecture Design III. RELATED WORK GPU platforms are especially suitable for training CNN models, as they are highly optimized for matrix-matrix multiplication operations. However, when doing inference tasks, the data usually needs to be processed in a non-batch manner, which leads to the degradation of the speed-up ratio of GPU. Also, the high cost, the power consumption, and the lack of portability of GPU platforms motivate the research and development of many dedicated CNN accelerators. These accelerators use different strategies to map computation to hardware, exploiting different parallelism and data reuse pattern. In general, there are four types of loop optimization techniques in these hardware accelerators [14], exploiting different types of parallelism: kernel spatial, data spatial, input channel, and output channel. Convolver: Qiu [8] proposes an architecture consisting of hardware convolvers: Each PE is a convolver that includes 9 multipliers and an adder tree. Chen [6] divides the convolution into 1-D convolution primitives and uses 1-D convolver PE. Lu [12] implements an FPGA accelerator with convolvers of Winograd transformation and reaches 3T ops performance. MAC: Multiply-accumulate (MAC) units are widely used in CNN accelerators. This kind of hardware architecture does not impose the restriction on the kernel size. Du [15] proposes an architecture using MAC, which involves inter-pe data flow for data reuse. ENVISION [7] is an architecture uses parallel MAC and a parallel buffer array and performs at T OP S/W. M V: Matrix-vector multiplication (M V) is widely used in software implementations of convolution layers. Han [16], [17] implements a M V architecture that utilizes the sparsity, which can be used in a large range of applications like CNN, RNN, and LSTM. Chen [4], [5] and Liu [18] divide the computation of CNN into M V operations and propose a series of architectures and corresponding instruction set. Our implementation utilizes the Winograd convolution algorithm, in which more results are computed using the same amount of operation. B. Scheduling Strategy Some accelerators assume the on-chip memory is enough for all the weights and intermediate data of a CNN model, as in the Diannao series [4], [15]. Other designs [6], [8] that utilize off-chip memory discuss the scheduling or data arrangement strategy to overcome the bandwidth limitation between the external and the on-chip memory. The work focuses on the intra-layer scheduling. Recent researches start to focus on

3 multi-layer scheduling to reduce data transfer. Alwani [10] uses HLS to implement and evaluate a pipelined multi-layer CNN accelerator for the first five convolution layers for VGG- E and showed that the multi-layer scheduling strategy can reduce the 95% data transfer. Li [19] implements a pipelined multi-layer CNN accelerator for AlexNet [20]. In these multilayer accelerator implementations, the computing modules of all layers are put on the hardware, which consume computing resources, making this kind of solution hard to be adapted to larger networks or deployed on smaller platforms. Also, these solutions are not flexible enough to be adapted to new network structure easily, as they explicitly designed the modules for each layer and the connection between each module. Winograd transformation accelerates CNN on calculation unit, and cross-layer strategy reduces intermediate data transfer on scheduling. We combine these two methods into an instruction driven hardware, and make the accelerator more efficient and flexible. IV. OVERALL WORK FLOW Figure 1 illustrates the work flow of our system. We firstly analyze the target CNN model and decide which layers to fuse. We propose a network dividing method named TMD method to minimize the intermediate data transfer. After fusing layers into several layer blobs, we generate instructions for each layer blob in cross-layer strategy. With the help of cross-layer, there is no data transfer inside a layer blob. We finally execute all of the instructions on the FPGA platform. Fig. 1. The work flow of our system. V. NETWORK DIVISION In this section, we introduce the network division method to minimize the data transfer of off-chip memory. We name this method a T ransfer Minimum Division (TMD) method. We see that the data volume distribution of each layer is extremely nonuniform. Figure 2 shows the input, output, and kernel data size of each layer for VGG-D network. For example, the convolution layer 1 requires 150KB of input and produces 3.4MB of data output feature maps. The output data are used to compute the following layer. Since the on-chip memory is limited on FPGA, an architecture usually cannot provide such large on-chip memory for all the output data. Therefore, data transfer occurs to store output data to Fig. 2. The input, output and kernel data size of each layer of VGG-D, and the data size of the first Layer Blob across layer 1 and layer 2. off-chip memory, and fetch output data from off-chip memory for the following computation. However, we notice that the output of the layer 2 in VGG-D net is only 750KB, which can be stored in the on-chip memory of our target FPGA platform (Xilinx Virtex-7 690T), so we don t maintain the whole output of the layer 1. We directly compute the output of layer 2 using the partial output of layer 1. In this way, we consider layer 1 and layer 2 as a whole Layer Blob and eliminate the data transfer. A. Hardware Constraints As described above, the weights of the Layer Blob are the weights of all layers in the blob. Layers can be merged into a Layer Blob (whose begin layer id is i and end layer id is j) when all the weights of the layers can be stored in the weight buffer on-chip as Equation (3). The volume of weights of the i th layer is weight k and the size of the on-chip weight buffer is Buffer weight. i k<j weight k Buffer weight (3) There is no intermediate featuremap transfer inside a Layer Blob. The data volume to be transferred at the border of each Layer Blob is decided by the size of the output featuremap. The Blob transfers no data when the output featuremap (Data out ) can be stored on-chip, or double of the output featuremap when it cannot be stored on-chip. Because the data transferred to the off-chip memory should be fetched back on

4 chip for the following computation. The data transfer of the Layer Blob ending with layer i is noted as BT i : BT i = B. Dividing method { 0 Dataout,i Buffer data 2 Data out,i Data out,i > Buffer data (4) We can define the description function of total data transfer (DT (M)) under a division method (M). It is the sum of data transfer in each Layer Blob. DT (M) = BT i (5) i blobs Our goal is to find a solution M under the hardware constraints governed by Equation (3), such that: DT (M ) = min(dt (M)) (6) We convert the CNN model to a directed acyclic graph (DAG) naturally. In the DAG, a node represents the output of a certain layer. Each node has its weight. The weight means the amount of the data transfer (BT i ) here given in Equation (4). We add a property lastnode to each node. lastnode indicates whether each node is the end layer of a Layer blob or not, and if this node is the end layer of a Layer Blob, lastnode also indicates the beginning layer of the same layer blob. The lastnode of all nodes form the dividing method M of a CNN net work. We can find a collection of subproblems of it. For the i th node, define the subproblem P (i) as: To find the optimal division solution for the graph before the i th node. It can be easily proved that the problem has the optimal substructure. Thus it can be solved using dynamic programming. To analyze the problem, we define function P (i) of the subproblem as follow: H(i) = {x x i} (7) H(i) is the set of nodes that before the ith node. P (i) = min (DT (M)) (8) M H(i) It is obvious that for the last node n, we have: P (n) = DT (M ) (9) So our goal is to find P (n) and the corresponding M. The Bellman Equation (which is a basic function in dynamic programming [21]) of the subproblem P (i) can be written as: P (i) = min j (P (j) + BT j ) (10) where j is a node before i and i, j satisfy the hardware constraints in Equation (3). Extra data transfer BT j is added to the total data transfer at the node j. With the given optimal substructure and Bellman Equation, we use dynamic programming to find out the target P (n). For each node in the network, the optimal division solution is one of the optimal division solutions for node j plus the weight at Algorithm 1 TMD: Transfer minimum division method 1: bt[i] is the BT in eq. (4), and lastnode gives a division method M in eq. (5). 2: P [0]=0 3: lastnode[0]=0 4: for i = 1 : n 1 do 5: P [i]=+ 6: for j = 0 : i 1 do 7: if Check (i, j) then 8: if P [j] + bt[j] < P [i] then 9: P [i] = P [j] + bt[j] 10: lastnode[i] = j 11: Layer Blob List = divide (lastnode) node j. Thus we can describe our process using pseudo codes in Algorithm 1. The Check () function checks the satisfiability with Equation (3). At each loop, a new layer is added to the search range. Repeating the steps until the last node of the network leads to the optimal division solution. The network division method shows the minimum data transfer under the constraint of on-chip memory and presents the division of layers to Layer Blobs. VI. CROSSLAYER WITH INSTRUCTIONS A. Cross-layer Strategy Previous work [10] explores the dimension in the design space of CNN accelerators that focuses on the data flow across convolution layers, and reduces 95% intermediate data transfer. The scheduling flow of the fuse layer in [10] is illustrated in algorithm 2. Algorithm 2 conventional convolution layer in [10] 1: in ch and out ch : the number of input and output channels 2: for (row = 0; row < H; row+ = S) do 3: for (col = 0; col < W ; col+ = S) do 4: for (layer = 0; layer < L; layer + +) do 5: for (fi = 0; fi < N i ; fi + +) do 6: for (fo = 0; fo < N o ; fo + +) do 7: for (i = 0; i < k x ; i + +) do 8: for (j = 0; j < k y ; j + +) do 9: out[row][col][fo]+= 10: weight[fi][fo][i][j]* 11: in[row+i][col+j][fi] For a net with L convolution layers, a conventional computation algorithm is shown in algorithm 2. The N i input featuremaps with size W H are convolved with k x k y weights in order to get the N o output feature maps. [10] modifies the traditional loop unrolling order that put the loop of the layer inside the loop of row and column(line 4). Because the computation structure of each layer differs from others, it designs different computation units and data reuse units for

5 different layers. The FPGA accelerator in [10] can only run a specific network and needs to be reprogrammed to run another network. We modify the loop unrolling strategy in [10] to make the cross-layer scheduling policy more flexible. We propose the cross-layer scheduling strategy in line, which is shown in algorithm 3. Algorithm 3 convolution layer computation with Winograd 1: for (row = 0; row < H; row+ = 2) do 2: for (layer = 0; layer < L; layer + + do 3: for (fo = 0; fo < N o ; fo + +) do 4: for (fi = 0; fi < N i ; fi + +) do 5: for (col = 0; col < W ; col+ = S) do 6: out[row][col][to]+= 7: Winograd(in,weight) The row is regarded as the outermost loop variable (line 1). Each time the computation window moves down by two lines and performs the computation through the whole Layer Blob. Furthermore, convolution operations are done by the fast Winograd computation kernel. Figure 3 shows our cross-layer scheduling strategy with an example of considering two convolutional layers as a whole. The size of the input featuremaps of the 1st layer is 7 7, the size of the convolution kernels of both layers is 3 3. Layer 1 operates on the line 1-3 of featuremap 1 as its input and convolves all the convolution kernels across these three lines, producing line 1 of featuremap 2. After line 1 of featuremap 2 being produced, the line 1 of layer 1 can be discarded. Then layer 1 operates on line 2-4 of featuremap 1 to produce line 2 of featuremap 2. With the same method, line 3 and line 4 of featuremap 2 and be calculated. As soon as line 1-3 of featuremap 2 is produced, layer 2 operates on these 3 lines of featuremap 2 to produce line 1 of featuremap 3. After line 1 of featuremap 3 being produced, the line 1 of layer 2 can be discarded. Generally, a line of a featuremap can be discarded as soon as a new line of new featuremap is computed. After finishing the computation of line 1 of featuremap 3, there is no need to compute a new three lines of featuremap 2 to prepare data for line 2 of featuremap 3. Because the computation of line 1 and line 2 of featuremap 3 shares the input data line 2 and line 3 of featuremap 2, we only compute a new line (line 4) of featuremap 2. For the same reason, we only need a new line in featuremap 1. In conclusion, to compute a new line of the output featuremap, only one new line is needed in the input featuremap. The new line of each input and intermediate featuremap can be stored in the same place of the former line recently discarded. In this strategy, there is no need to store an entire featuremap on chip. We only need to store K lines of each featuremap on chip, where K is the size of convolution kernel (k = 3 in this example). The convolution layers transfer no intermediate featuremaps under the cross-layer strategy. We name these layers as Layer Fig. 3. Cross-layer scheduling example of two layers Blob. It should be noticed that we assume all the kernel weights of the convolution layers can be stored on chip. If the kernel weights are too large, the cross-layer strategy loses efficiency due to weights transfer, and we compute CNN layer after layer in sequence in this condition. B. Instructions An instruction executes a basic operation of the hardware. As described in section VI-A, the basic operation of our crosslayer scheduling policy is the calculation of an output line of a layer, i.e. a calculation instruction finishes line 5-7 in algorithm 3, and the number of columns W is given in the instruction. We implement a compiler to generate instructions in cross-layer scheduling for each layer group. Besides calculation instructions, we also design instructions for data transfer. Data transfer instructions and calculation instructions can be executed in parallel. The addresses in calculation and data transfer instructions allocate the on-chip memory to the CNN featuremaps and weights for different CNN layers. VII. HARDWARE DESIGN In this section, we propose a structure to combine Winograd and Cross-layer scheduling together. A. Architecture Overview Figure 4 presents the architecture overview of our implementation of Winograd and cross-layer supporting on FPGA.

6 Fig. 4. Architecture overview To store input data and output data in the same buffer, the hardware parallelism of input channels and output channels should be the same. An input register file is designed to shuffle data in case of zero-padding. Weights are fetched from Weight Buffer in 3 3 size. PEs compute eq. (1) with the input tile and weights. We also design an output register buffer data in case of pooling. In our design, there are several Buffer Pools and Weights Buffer for featuremaps and weights in different channels. For easy illustration, we assume there is only one Buffer Pool and Weights Buffer, And the hardware parallelism is 1 1. B. Data Buffer Design Previous work usually adapts Line Buffer structure for data reuse. Figure 5 shows the basic structure of line buffer. Initially, the line buffer will read the first M(= 2 in this figure) lines from input data. PE will stay idle until the M lines of input data are fully read. The first line of the result is computed synchronously while reading the M+1 line of input data. Considering the scheduling of cross-layer strategy mentioned in section VI, the basic operation is to calculate a line of the output featuremap. After finishing a line of the output featuremap, the next operation may start to compute on a new featuremap, and the PE will return idle waiting for the M lines of this featuremap. In cross-layer scheduling, the redundance of reading first M lines will keep PE idle in most of the time, resulting in low utilization of PE. However, line buffer is suitable for weights buffer, for the lines of weights are much shorter. The redundance of the two lines of weights can also be covered by pre-loading and registering the weights. In order to support cross-layer scheduling, the input data buffer and the output data buffer should be merged into a same buffer, called Buffer Pool in our implementation. Each buffer pool consists of 4 RAMs, and Each ram provides 4 input data a clock cycle. The addresses of RAMs is given Fig. 5. Line buffer structure. PE stay idle while loading the first M lines. Not good for cross-layer scheduling. in instruction, and instruction also provides the write back addresses of RAMs to store the result of convolution layer. C. Winograd PE Design Figure 6 shows the structure of PEs. The calculation of convolution layer can be divided into four stages. The first stage is to do the transformation of input tile and weights. Note that the transformation of input data and weights can be divided into two constant matrix multiplication, the second multiplication of input tile is arranged in DSPs. The second stage is the element-wise multiplication of transformed input tile and weights. We use DSPs to perform this operation. The third stage is the transformation of result. The forth stage is to accumulate the output tiles from different input channels. Fig. 6. PE design We notice that DSPs can do more than element-wise multiplication. In our design, we set DSPs in the mode of (A + B) D, which can calculate the B T and the to make full use of DSPs. The computation on hardware also parallelizes the loops of input channel and output channel. We define the unroll factors of input channel and output channel are P i and P o. Therefore, there are a total of P i P o PEs in the hardware design. Moreover, in order to store input data and output data in the same buffer, P i is the same with P o. Each PE consists of 16 DSPs. Therefore, P o and P i are constrained by the number of DSPs on FPGA platform as eq. (11). The number of DSPs is described as DSP. P i P o 16 DSP (11)

7 VIII. EVALUATION A. Network Division Evaluation We evaluate our network division strategy on different conditions of the on-chip memory for intermediate data and wights. VGG-D network (illustrated in Figure 2) is chosen as the benchmark. The size of data needed to be transferred with and without network division (TMD) is shown in Table I TABLE I INTERMEDIATE DATA TRANSFER ON DIFFERENT ON-CHIP MEMORY CONDITIONS WITH AND WITHOUT TMD. On-chip memory Transfer Transfer Transfer condition 1 Strategy data reduced reduced (MB) (MB) percentage data=2mb Plain 6.60 weight=2mb TMD % data=1.2mb Plain 9.81 weight=1.2mb TMD % data=1.2mb Plain 9.81 weight=0.6mb TMD % data=0.6m Plain weight=1.2mb TMD % data=0.6mb Plain weight=0.6mb TMD % 1 The on-chip memory condition means data and weight buffer size on chip. 2 Intermediate data transfer is eliminated except for the input and output of the whole CNN. We can conclude that the size of the data buffer decides the data transfer without network division. With the help of a larger data buffer, the data transfer can be reduced by computing a Layer Blob across more layers. In FPGA evaluation, the size of the data buffer and the weight buffer is 2MB. We eliminate intermediate data transfer in the following FPGA evaluation. We remind the readers again that our cross-layer scheduling and network dividing method eliminate data transfer of intermediate featuremaps. The data transfer of weights of convolution layers can not be reduced. B. FPGA Experiment Setup We evaluate our hardware architecture on the FPGA platform Xilinx Virtex-7 XC7V690T. The external memory of this platform consists of 3600 DSPs and two 1GB DDR3. According to eq. (11), P i = P o 15. To make full use these two DDRs, we set P i and P o to 8 for one CNN accelerator, and we implement two CNN accelerators on the FPGA chip. In this way, the Batch of our hardware is 2. We use PCIe interface to initial DDRs on board and control the CNN accelerators on board. The system is operated at 200M hz frequency. The whole system costs 250k logic cells on board, and the interface control units (PCIe and MIG) consume 60k logic cells. [8], [12] use Xilinx Zynq SOCs and do not need to implement PCIe and MIG control units on FPGA fabric. [10] simulates its core system without Peripherals. We compare our work to previous work excluding these peripherals control units. It is believed that 8-bit fixed point number is enough to quantize feature maps and weights of CNN [8]. So we use 8 bit for data storage and computation. It should be noted that this work concentrates on accelerating convolution layers of CNN. The results of our evaluation are based on the convolution layers of different CNN models. The other operations like FC are completed on CPU. C. Cross-layer vs Plain Scheduling We evaluate our CNN accelerator on VGG-A, VGG-D and a detection model YOLO [22]. We generate instructions for these models to run different CNN networks on the same hardware platform. Table II shows the performance comparison between scheduling policy with and without cross-layer optimization. TABLE II PERFORMANCE COMPARISON BETWEEN STRATEGY WITH AND WITHOUT TMD Problem Complexity (GOP) Weights Transfer(MB) Intermediate Data Transfer(MB) Total Trasfer(MB) Total Time(ms) YOLO 1 VGG-A VGG-D Plain/TMD Plain TMD Plain TMD Speed up - 12% 7% 1 All intermediate data of any YOLO layer can be stored on board,and the scheduling of with and without cloasslayer is equal. As is shown in the results, around 12% of acceleration is achieved by adapting cross-layer scheduling on a small model VGG-A, and 7% on a larger model VGG-D It should be noticed that when the network goes large, the weights of Convolutions overwhelms the intermediate data. So the cross-layer speed-up ratio of a larger model (VGG-D) is lower than the speed-up ratio of a smaller model(vgg-a). However, when the intermediate data of a single layer goes small, our cross-layer strategy equals to the plain strategy. D. Comparison With Previous Work The performance comparison with other FPGA work is shown in Table III. Our work achieves state-of-the-art performance on FPGA. Moreover, our design is an instruction driven accelerator, providing flexibility for different CNN models. We improve the performance of convolution layers from 566 GOP/s to 1467 GOP/s. It should be noticed the Winograd kernel of our design is 4 4 and accelerates convolution The Winograd kernel of [12] is 6 6 and accelerates convolution 4. As described in section II, we choose 4 4 Winograd kernel to guarantee the precision of CNN models. We can also use a larger Winograd

8 TABLE III PERFORMANCE COMPARISON WITH STATE-OF-THE-ART FPGA ACCELERATORS [8] [9] [19] [10] [12] Ours Platform Zynq XC7Z045 Virtex 7 VX485T Virtex 7 VX690T Virtex 7 VX690T MPSOC ZCU102 Virtex 7 VX690T Clock(MHz) Data Format 16-bit fixed 32-bit float 16-bit fixed 32-bit float 16-bit fixed 8-bit fixed Network VGG-D AlexNet [17] AlexNet Problem Complexity(GOP) VGG-E layer 1-5 VGG-D VGG-D Batch Used DSPs Logic Cell(K) Performance(GOP/s) DSP Efficiency (GOP/s/DSP) Logic Cell Efficiency(GOP/s/cell) 1 The frequency is estimated to calculate performance kernel size like 6 6 for tasks requiring less precision.if we use 6 6 Winograd kernel in our design, we can achieve 78% more performance with the same number of DSPs, and achieve the same DSP efficiency with the state-of-the-art work [12]. IX. CONCLUSION In this paper, we propose an instruction driven CNN accelerator based on FPGA with the optimization of cross-layer scheduling and Winograd computing unit. A network transfer minimum division (TMD) method is presented to minimize the data transfer for general networks. We also design a compiler with cross-layer scheduling that maps CNN to accelerators with high flexibility. Then we propose a hardware architecture to support the Winograd computation and instruction set on a Xilinx Virtex-7 FPGA while achieving a 1.5 logic cell efficiency compared to state-of-the-art designs. REFERENCES [1] N. McLaughlin, J. Martinez del Rincon, and P. Miller, Recurrent convolutional network for video-based person re-identification, pp , [2] S. Zhang, C. Liu, H. Jiang, S. Wei, L. Dai, and Y. Hu, Feedforward Sequential Memory Networks: A New Structure to Learn Long-term Dependency, arxiv: [cs], [3] K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, arxiv preprint arxiv: , [4] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning, vol. 49, pp , ACM, [5] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, and others, Dadiannao: A machine-learning supercomputer, pp , IEEE Computer Society, [6] Y. H. Chen, T. Krishna, J. S. Emer, and V. Sze, Eyeriss: An Energy- Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks, IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp , [7] B. Moons, R. Uytterhoeven, W. Dehaene, and M. Verhelst, 14.5 Envision: A 0.26-to-10TOPS/W subword-parallel dynamic-voltage-accuracyfrequency-scalable Convolutional Neural Network processor in 28nm FDSOI, in ICCSS, pp , [8] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, N. Xu, S. Song, Y. Wang, and H. Yang, Going Deeper with Embedded FPGA Platform for Convolutional Neural Network, FPGA 16, pp , ACM, [9] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, Optimizing fpga-based accelerator design for deep convolutional neural networks, pp , ACM, [10] M. Alwani, H. Chen, M. Ferdman, and P. Milder, Fused-layer CNN accelerators, in MICRO, pp. 1 12, [11] S. Winograd, Arithmetic complexity of computations, vol. 33. Siam, [12] L. Lu, Y. Liang, Q. Xiao, and S. Yan, Evaluating fast algorithms for convolutional neural networks on fpgas, in FCCM, pp , IEEE, [13] A. Lavin and S. Gray, Fast algorithms for convolutional neural networks, pp , [14] Y. Ma, Y. Cao, S. Vrudhula, and J.-s. Seo, Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks, FPGA 17, pp , ACM, [15] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam, ShiDianNao: Shifting vision processing closer to the sensor, vol. 43, pp , ACM, [16] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, EIE: efficient inference engine on compressed deep neural network, pp , IEEE Press, [17] S. Han, J. Kang, H. Mao, Y. Hu, X. Li, Y. Li, D. Xie, H. Luo, S. Yao, Y. Wang, H. Yang, and W. B. J. Dally, ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA, FPGA 17, pp , ACM, [18] S. Liu, Z. Du, J. Tao, D. Han, T. Luo, Y. Xie, Y. Chen, and T. Chen, Cambricon: An instruction set architecture for neural networks, pp , IEEE Press, [19] H. Li, X. Fan, L. Jiao, W. Cao, X. Zhou, and L. Wang, A high performance FPGA-based accelerator for large-scale convolutional neural networks, pp. 1 9, IEEE, [20] V. Gokhale, J. Jin, A. Dundar, B. Martini, and E. Culurciello, A 240 g-ops/s mobile coprocessor for deep neural networks, pp , [21] R. Bellman, Dynamic programming and Lagrange multipliers, Proceedings of the National Academy of Sciences, vol. 42, pp , [22] J. Redmon and A. Farhadi, Yolo9000: better, faster, stronger, arxiv preprint arxiv: , 2016.

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Naveen Suda, Vikas Chandra *, Ganesh Dasika *, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, Yu

More information

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Junzhong Shen, You Huang, Zelong Wang, Yuran Qiao, Mei Wen, Chunyuan Zhang National University of Defense Technology,

More information

Real-time object detection towards high power efficiency

Real-time object detection towards high power efficiency Real-time object detection towards high power efficiency Jincheng Yu, Kaiyuan Guo, Yiming Hu, Xuefei Ning, Jiantao Qiu, Huizi Mao, Song Yao, Tianqi Tang, Boxun Li, Yu Wang, and Huazhong Yang Tsinghua University,

More information

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Yufei Ma, Naveen Suda, Yu Cao, Jae-sun Seo, Sarma Vrudhula School of Electrical, Computer and Energy Engineering School

More information

Switched by Input: Power Efficient Structure for RRAMbased Convolutional Neural Network

Switched by Input: Power Efficient Structure for RRAMbased Convolutional Neural Network Switched by Input: Power Efficient Structure for RRAMbased Convolutional Neural Network Lixue Xia, Tianqi Tang, Wenqin Huangfu, Ming Cheng, Xiling Yin, Boxun Li, Yu Wang, Huazhong Yang Dept. of E.E., Tsinghua

More information

Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design

Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design Song Yao 姚颂 Founder & CEO DeePhi Tech 深鉴科技 song.yao@deephi.tech Outline - About DeePhi Tech - Background - Bandwidth Matters

More information

Deep Learning Accelerators

Deep Learning Accelerators Deep Learning Accelerators Abhishek Srivastava (as29) Samarth Kulshreshtha (samarth5) University of Illinois, Urbana-Champaign Submitted as a requirement for CS 433 graduate student project Outline Introduction

More information

Revolutionizing the Datacenter

Revolutionizing the Datacenter Power-Efficient Machine Learning using FPGAs on POWER Systems Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx Revolutionizing the Datacenter Join the Conversation #OpenPOWERSummit Top-5

More information

Exploring Heterogeneous Algorithms for Accelerating Deep Convolutional Neural Networks on FPGAs

Exploring Heterogeneous Algorithms for Accelerating Deep Convolutional Neural Networks on FPGAs Exploring Heterogeneous Algorithms for Accelerating Deep Convolutional Neural Networks on FPGAs Qingcheng Xiao *1, Yun Liang 1, Liqiang Lu 1, Shengen Yan,3 and Yu-Wing Tai 3 1 Center for Energy-efficient

More information

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM

More information

Binary Convolutional Neural Network on RRAM

Binary Convolutional Neural Network on RRAM Binary Convolutional Neural Network on RRAM Tianqi Tang, Lixue Xia, Boxun Li, Yu Wang, Huazhong Yang Dept. of E.E, Tsinghua National Laboratory for Information Science and Technology (TNList) Tsinghua

More information

Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks

Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks Yu-Hsin Chen 1, Joel Emer 1, 2, Vivienne Sze 1 1 MIT 2 NVIDIA 1 Contributions of This Work A novel energy-efficient

More information

Channel Locality Block: A Variant of Squeeze-and-Excitation

Channel Locality Block: A Variant of Squeeze-and-Excitation Channel Locality Block: A Variant of Squeeze-and-Excitation 1 st Huayu Li Northern Arizona University Flagstaff, United State Northern Arizona University hl459@nau.edu arxiv:1901.01493v1 [cs.lg] 6 Jan

More information

In Live Computer Vision

In Live Computer Vision EVA 2 : Exploiting Temporal Redundancy In Live Computer Vision Mark Buckler, Philip Bedoukian, Suren Jayasuriya, Adrian Sampson International Symposium on Computer Architecture (ISCA) Tuesday June 5, 2018

More information

Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks

Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks Abstract Deep Convolutional Neural Networks (DCNN) have proven to be very effective in many pattern recognition applications, such

More information

DNN Accelerator Architectures

DNN Accelerator Architectures DNN Accelerator Architectures ISCA Tutorial (2017) Website: http://eyeriss.mit.edu/tutorial.html Joel Emer, Vivienne Sze, Yu-Hsin Chen 1 2 Highly-Parallel Compute Paradigms Temporal Architecture (SIMD/SIMT)

More information

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency with ML accelerators Michael

More information

PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory

PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory Scalable and Energy-Efficient Architecture Lab (SEAL) PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in -based Main Memory Ping Chi *, Shuangchen Li *, Tao Zhang, Cong

More information

Computer Architectures for Deep Learning. Ethan Dell and Daniyal Iqbal

Computer Architectures for Deep Learning. Ethan Dell and Daniyal Iqbal Computer Architectures for Deep Learning Ethan Dell and Daniyal Iqbal Agenda Introduction to Deep Learning Challenges Architectural Solutions Hardware Architectures CPUs GPUs Accelerators FPGAs SOCs ASICs

More information

Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks

Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks Mohammad Motamedi, Philipp Gysel, Venkatesh Akella and Soheil Ghiasi Electrical and Computer Engineering Department, University

More information

SmartShuttle: Optimizing Off-Chip Memory Accesses for Deep Learning Accelerators

SmartShuttle: Optimizing Off-Chip Memory Accesses for Deep Learning Accelerators SmartShuttle: Optimizing Off-Chip emory Accesses for Deep Learning Accelerators Jiajun Li, Guihai Yan, Wenyan Lu, Shuhao Jiang, Shijun Gong, Jingya Wu, Xiaowei Li State Key Laboratory of Computer Architecture,

More information

SOFTWARE HARDWARE CODESIGN ACCELERATION FOR EFFICIENT NEURAL NETWORK. ...Deep learning and neural

SOFTWARE HARDWARE CODESIGN ACCELERATION FOR EFFICIENT NEURAL NETWORK. ...Deep learning and neural ... SOFTWARE HARDWARE CODESIGN FOR EFFICIENT NEURAL NETWORK ACCELERATION... Kaiyuan Guo Tsinghua University and DeePhi Song Han Stanford University and DeePhi Song Yao DeePhi Yu Wang Tsinghua University

More information

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs Ritchie Zhao 1, Weinan Song 2, Wentao Zhang 2, Tianwei Xing 3, Jeng-Hau Lin 4, Mani Srivastava 3, Rajesh Gupta 4, Zhiru

More information

Towards Design Space Exploration and Optimization of Fast Algorithms for Convolutional Neural Networks (CNNs) on FPGAs

Towards Design Space Exploration and Optimization of Fast Algorithms for Convolutional Neural Networks (CNNs) on FPGAs Preprint: Accepted at 22nd IEEE Design, Automation & Test in Europe Conference & Exhibition (DATE 19) Towards Design Space Exploration and Optimization of Fast Algorithms for Convolutional Neural Networks

More information

ESE: Efficient Speech Recognition Engine for Sparse LSTM on FPGA

ESE: Efficient Speech Recognition Engine for Sparse LSTM on FPGA ESE: Efficient Speech Recognition Engine for Sparse LSTM on FPGA Song Han 1,2, Junlong Kang 2, Huizi Mao 1, Yiming Hu 3, Xin Li 2, Yubin Li 2, Dongliang Xie 2, Hong Luo 2, Song Yao 2, Yu Wang 2,3, Huazhong

More information

Feature-Fused SSD: Fast Detection for Small Objects

Feature-Fused SSD: Fast Detection for Small Objects Feature-Fused SSD: Fast Detection for Small Objects Guimei Cao, Xuemei Xie, Wenzhe Yang, Quan Liao, Guangming Shi, Jinjian Wu School of Electronic Engineering, Xidian University, China xmxie@mail.xidian.edu.cn

More information

Scaling Neural Network Acceleration using Coarse-Grained Parallelism

Scaling Neural Network Acceleration using Coarse-Grained Parallelism Scaling Neural Network Acceleration using Coarse-Grained Parallelism Mingyu Gao, Xuan Yang, Jing Pu, Mark Horowitz, Christos Kozyrakis Stanford University Platform Lab Review Feb 2018 Neural Networks (NNs)

More information

How to Estimate the Energy Consumption of Deep Neural Networks

How to Estimate the Energy Consumption of Deep Neural Networks How to Estimate the Energy Consumption of Deep Neural Networks Tien-Ju Yang, Yu-Hsin Chen, Joel Emer, Vivienne Sze MIT 1 Problem of DNNs Recognition Smart Drone AI Computation DNN 15k 300k OP/Px DPM 0.1k

More information

Optimizing CNN-based Object Detection Algorithms on Embedded FPGA Platforms

Optimizing CNN-based Object Detection Algorithms on Embedded FPGA Platforms Optimizing CNN-based Object Detection Algorithms on Embedded FPGA Platforms Ruizhe Zhao 1, Xinyu Niu 1, Yajie Wu 2, Wayne Luk 1, and Qiang Liu 3 1 Imperial College London {ruizhe.zhao15,niu.xinyu10,w.luk}@imperial.ac.uk

More information

A Method to Estimate the Energy Consumption of Deep Neural Networks

A Method to Estimate the Energy Consumption of Deep Neural Networks A Method to Estimate the Consumption of Deep Neural Networks Tien-Ju Yang, Yu-Hsin Chen, Joel Emer, Vivienne Sze Massachusetts Institute of Technology, Cambridge, MA, USA {tjy, yhchen, jsemer, sze}@mit.edu

More information

arxiv: v1 [cs.cv] 11 Feb 2018

arxiv: v1 [cs.cv] 11 Feb 2018 arxiv:8.8v [cs.cv] Feb 8 - Partitioning of Deep Neural Networks with Feature Space Encoding for Resource-Constrained Internet-of-Things Platforms ABSTRACT Jong Hwan Ko, Taesik Na, Mohammad Faisal Amir,

More information

USING DATAFLOW TO OPTIMIZE ENERGY EFFICIENCY OF DEEP NEURAL NETWORK ACCELERATORS

USING DATAFLOW TO OPTIMIZE ENERGY EFFICIENCY OF DEEP NEURAL NETWORK ACCELERATORS ... USING DATAFLOW TO OPTIMIZE ENERGY EFFICIENCY OF DEEP NEURAL NETWORK ACCELERATORS... Yu-Hsin Chen Massachusetts Institute of Technology Joel Emer Nvidia and Massachusetts Institute of Technology Vivienne

More information

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs IBM Research AI Systems Day DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang 1, Junsong Wang 2, Chao Zhu 2, Yonghua Lin 2, Jinjun Xiong 3, Wen-mei

More information

Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Network

Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Network Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Network Chen Zhang 1, Peng Li 3, Guangyu Sun 1,2, Yijin Guan 1, Bingjun Xiao 3, Jason Cong 1,2,3 1 Peking University 2 PKU/UCLA Joint

More information

XPU A Programmable FPGA Accelerator for Diverse Workloads

XPU A Programmable FPGA Accelerator for Diverse Workloads XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for

More information

ECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University

ECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University ECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University Lab 4: Binarized Convolutional Neural Networks Due Wednesday, October 31, 2018, 11:59pm

More information

Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System

Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System Xiaofan Zhang1, Mohamed El Hadedy1, Wen-mei Hwu1, Nam Sung Kim1, Jinjun Xiong2, Deming Chen1 1 University of Illinois Urbana-Champaign

More information

arxiv: v2 [cs.ar] 15 May 2018

arxiv: v2 [cs.ar] 15 May 2018 [DL] A Survey of FPGA Based Neural Network Accelerator arxiv:1712.08934v2 [cs.ar] 15 May 2018 KAIYUAN GUO, SHULIN ZENG, JINCHENG YU, YU WANG AND HUAZHONG YANG, Tsinghua University, China Recent researches

More information

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, Christos Kozyrakis Stanford University Platform Lab Review Feb 2017 Deep Neural

More information

arxiv: v1 [cs.dc] 1 Dec 2018

arxiv: v1 [cs.dc] 1 Dec 2018 DeCoILFNet: Depth Concatenation and Inter-Layer Fusion based ConvNet Accelerator Akanksha Baranwal 1,, Ishan Bansal 1,, Roopal Nahar 1, K.Madhava Krishna 1 arxiv:1901.02774v1 [cs.dc] 1 Dec 2018 Abstract

More information

Index. Springer Nature Switzerland AG 2019 B. Moons et al., Embedded Deep Learning,

Index. Springer Nature Switzerland AG 2019 B. Moons et al., Embedded Deep Learning, Index A Algorithmic noise tolerance (ANT), 93 94 Application specific instruction set processors (ASIPs), 115 116 Approximate computing application level, 95 circuits-levels, 93 94 DAS and DVAS, 107 110

More information

Extend the shallow part of Single Shot MultiBox Detector via Convolutional Neural Network

Extend the shallow part of Single Shot MultiBox Detector via Convolutional Neural Network Extend the shallow part of Single Shot MultiBox Detector via Convolutional Neural Network Liwen Zheng, Canmiao Fu, Yong Zhao * School of Electronic and Computer Engineering, Shenzhen Graduate School of

More information

Xilinx ML Suite Overview

Xilinx ML Suite Overview Xilinx ML Suite Overview Yao Fu System Architect Data Center Acceleration Xilinx Accelerated Computing Workloads Machine Learning Inference Image classification and object detection Video Streaming Frame

More information

Research Faculty Summit Systems Fueling future disruptions

Research Faculty Summit Systems Fueling future disruptions Research Faculty Summit 2018 Systems Fueling future disruptions Efficient Edge Computing for Deep Neural Networks and Beyond Vivienne Sze In collaboration with Yu-Hsin Chen, Joel Emer, Tien-Ju Yang, Sertac

More information

BHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques

BHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques BHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques Jingyang Zhu 1, Zhiliang Qian 2*, and Chi-Ying Tsui 1 1 The Hong Kong University of Science and

More information

Energy-Efficient CNN Implementation on a Deeply Pipelined FPGA Cluster

Energy-Efficient CNN Implementation on a Deeply Pipelined FPGA Cluster Energy-Efficient CNN Implementation on a Deeply Pipelined Cluster Chen Zhang 1, Di Wu, Jiayu Sun 1, Guangyu Sun 1,3, Guojie Luo 1,3, and Jason Cong 1,,3 1 Center for Energy-Efficient Computing and Applications,

More information

Flexible On-chip Memory Architecture for DCNN Accelerators

Flexible On-chip Memory Architecture for DCNN Accelerators Flexible On-chip Memory Architecture for DCNN Accelerators Arash Azizimazreah Lizhong Chen Oregon State University, OR, 97331, USA {azizimaa, chenliz}@oregonstate.edu ABSTRACT Recent studies show that

More information

ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic- Voltage-Accuracy-Frequency- Scalable CNN Processor in 28nm FDSOI

ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic- Voltage-Accuracy-Frequency- Scalable CNN Processor in 28nm FDSOI ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic- Voltage-Accuracy-Frequency- Scalable CNN Processor in 28nm FDSOI Bert oons, Roel Uytterhoeven, Wim Dehaene, arian Verhelst ESAT/ICAS - KU Leuven

More information

Lab 4: Convolutional Neural Networks Due Friday, November 3, 2017, 11:59pm

Lab 4: Convolutional Neural Networks Due Friday, November 3, 2017, 11:59pm ECE5775 High-Level Digital Design Automation, Fall 2017 School of Electrical Computer Engineering, Cornell University Lab 4: Convolutional Neural Networks Due Friday, November 3, 2017, 11:59pm 1 Introduction

More information

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

SDA: Software-Defined Accelerator for Large- Scale DNN Systems SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, 1 Yong Wang, 1 Bo Yu, 1 Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A

More information

direct hardware mapping of cnns on fpga-based smart cameras

direct hardware mapping of cnns on fpga-based smart cameras direct hardware mapping of cnns on fpga-based smart cameras Workshop on Architecture of Smart Cameras Kamel ABDELOUAHAB, Francois BERRY, Maxime PELCAT, Jocelyn SEROT, Jean-Charles QUINTON Cordoba, June

More information

Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research

Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research Nick Fraser (Xilinx & USydney) Yaman Umuroglu (Xilinx & NTNU) Giulio Gambardella (Xilinx)

More information

Parallel graph traversal for FPGA

Parallel graph traversal for FPGA LETTER IEICE Electronics Express, Vol.11, No.7, 1 6 Parallel graph traversal for FPGA Shice Ni a), Yong Dou, Dan Zou, Rongchun Li, and Qiang Wang National Laboratory for Parallel and Distributed Processing,

More information

CNP: An FPGA-based Processor for Convolutional Networks

CNP: An FPGA-based Processor for Convolutional Networks Clément Farabet clement.farabet@gmail.com Computational & Biological Learning Laboratory Courant Institute, NYU Joint work with: Yann LeCun, Cyril Poulet, Jefferson Y. Han Now collaborating with Eugenio

More information

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

SDA: Software-Defined Accelerator for Large- Scale DNN Systems SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, Yong Wang, Bo Yu, Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A dominant

More information

Redundancy-reduced MobileNet Acceleration on Reconfigurable Logic For ImageNet Classification

Redundancy-reduced MobileNet Acceleration on Reconfigurable Logic For ImageNet Classification Redundancy-reduced MobileNet Acceleration on Reconfigurable Logic For ImageNet Classification Jiang Su, Julian Faraone, Junyi Liu, Yiren Zhao, David B. Thomas, Philip H. W. Leong, and Peter Y. K. Cheung

More information

Value-driven Synthesis for Neural Network ASICs

Value-driven Synthesis for Neural Network ASICs Value-driven Synthesis for Neural Network ASICs Zhiyuan Yang University of Maryland, College Park zyyang@umd.edu ABSTRACT In order to enable low power and high performance evaluation of neural network

More information

Robust Face Recognition Based on Convolutional Neural Network

Robust Face Recognition Based on Convolutional Neural Network 2017 2nd International Conference on Manufacturing Science and Information Engineering (ICMSIE 2017) ISBN: 978-1-60595-516-2 Robust Face Recognition Based on Convolutional Neural Network Ying Xu, Hui Ma,

More information

RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC. Zoltan Baruch

RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC. Zoltan Baruch RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC Zoltan Baruch Computer Science Department, Technical University of Cluj-Napoca, 26-28, Bariţiu St., 3400 Cluj-Napoca,

More information

Neural Network-Hardware Co-design for Scalable RRAM-based BNN Accelerators

Neural Network-Hardware Co-design for Scalable RRAM-based BNN Accelerators Neural Network-Hardware Co-design for Scalable RRAM-based BNN Accelerators Yulhwa Kim, Hyungjun Kim, and Jae-Joon Kim Dept. of Creative IT Engineering, Pohang University of Science and Technology (POSTECH),

More information

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters *Argonne National Lab +BU & USTC Presented by Martin Herbordt Work by Ahmed

More information

Machine Learning 13. week

Machine Learning 13. week Machine Learning 13. week Deep Learning Convolutional Neural Network Recurrent Neural Network 1 Why Deep Learning is so Popular? 1. Increase in the amount of data Thanks to the Internet, huge amount of

More information

C-Brain: A Deep Learning Accelerator

C-Brain: A Deep Learning Accelerator C-Brain: A Deep Learning Accelerator that Tames the Diversity of CNNs through Adaptive Data-level Parallelization Lili Song, Ying Wang, Yinhe Han, Xin Zhao, Bosheng Liu, Xiaowei Li State Key Laboratory

More information

Research Faculty Summit Systems Fueling future disruptions

Research Faculty Summit Systems Fueling future disruptions Research Faculty Summit 2018 Systems Fueling future disruptions Wolong: A Back-end Optimizer for Deep Learning Computation Jilong Xue Researcher, Microsoft Research Asia System Challenge in Deep Learning

More information

High Performance Computing Hiroki Kanezashi Tokyo Institute of Technology Dept. of mathematical and computing sciences Matsuoka Lab.

High Performance Computing Hiroki Kanezashi Tokyo Institute of Technology Dept. of mathematical and computing sciences Matsuoka Lab. High Performance Computing 2015 Hiroki Kanezashi Tokyo Institute of Technology Dept. of mathematical and computing sciences Matsuoka Lab. 1 Reviewed Paper 1 DaDianNao: A Machine- Learning Supercomputer

More information

Chain-NN: An Energy-Efficient 1D Chain Architecture for Accelerating Deep Convolutional Neural Networks

Chain-NN: An Energy-Efficient 1D Chain Architecture for Accelerating Deep Convolutional Neural Networks Chain-NN: An Energy-Efficient D Chain Architecture for Accelerating Deep Convolutional Neural Networks Shihao Wang, Dajiang Zhou, Xushen Han, Takeshi Yoshimura Graduate School of Information, Production

More information

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis Bruno da Silva, Jan Lemeire, An Braeken, and Abdellah Touhafi Vrije Universiteit Brussel (VUB), INDI and ETRO department, Brussels,

More information

Content-Based Image Recovery

Content-Based Image Recovery Content-Based Image Recovery Hong-Yu Zhou and Jianxin Wu National Key Laboratory for Novel Software Technology Nanjing University, China zhouhy@lamda.nju.edu.cn wujx2001@nju.edu.cn Abstract. We propose

More information

Profiling the Performance of Binarized Neural Networks. Daniel Lerner, Jared Pierce, Blake Wetherton, Jialiang Zhang

Profiling the Performance of Binarized Neural Networks. Daniel Lerner, Jared Pierce, Blake Wetherton, Jialiang Zhang Profiling the Performance of Binarized Neural Networks Daniel Lerner, Jared Pierce, Blake Wetherton, Jialiang Zhang 1 Outline Project Significance Prior Work Research Objectives Hypotheses Testing Framework

More information

Elastic Processor for Deep Learning

Elastic Processor for Deep Learning INSTITUTE OF COMPUTING TECHNOLOGY 中科寒武纪 Elastic Processor for Deep Learning Zhiwei Xu Institute of Computing Technology (ICT) Chinese Academy of Sciences (CAS) http://novel.ict.ac.cn/zxu/ zxu@ict.ac.cn

More information

Shortcut Mining: Exploiting Cross-layer Shortcut Reuse in DCNN Accelerators

Shortcut Mining: Exploiting Cross-layer Shortcut Reuse in DCNN Accelerators Shortcut Mining: Exploiting Cross-layer Shortcut Reuse in DCNN Accelerators Arash Azizimazreah, and Lizhong Chen School of Electrical Engineering and Computer Science Oregon State University, USA {azizimaa,

More information

Proceedings of the International MultiConference of Engineers and Computer Scientists 2018 Vol I IMECS 2018, March 14-16, 2018, Hong Kong

Proceedings of the International MultiConference of Engineers and Computer Scientists 2018 Vol I IMECS 2018, March 14-16, 2018, Hong Kong , March 14-16, 2018, Hong Kong , March 14-16, 2018, Hong Kong , March 14-16, 2018, Hong Kong , March 14-16, 2018, Hong Kong TABLE I CLASSIFICATION ACCURACY OF DIFFERENT PRE-TRAINED MODELS ON THE TEST DATA

More information

arxiv: v2 [cs.cv] 3 May 2016

arxiv: v2 [cs.cv] 3 May 2016 EIE: Efficient Inference Engine on Compressed Deep Neural Network Song Han Xingyu Liu Huizi Mao Jing Pu Ardavan Pedram Mark A. Horowitz William J. Dally Stanford University, NVIDIA {songhan,xyl,huizi,jingpu,perdavan,horowitz,dally}@stanford.edu

More information

Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture

Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture The 51st Annual IEEE/ACM International Symposium on Microarchitecture Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture Byungchul Hong Yeonju Ro John Kim FuriosaAI Samsung

More information

Layer-wise Performance Bottleneck Analysis of Deep Neural Networks

Layer-wise Performance Bottleneck Analysis of Deep Neural Networks Layer-wise Performance Bottleneck Analysis of Deep Neural Networks Hengyu Zhao, Colin Weinshenker*, Mohamed Ibrahim*, Adwait Jog*, Jishen Zhao University of California, Santa Cruz, *The College of William

More information

arxiv: v2 [cs.lg] 16 Nov 2018

arxiv: v2 [cs.lg] 16 Nov 2018 MINI-BATCH SERIALIZATION: CNN TRAINING WITH INTER-LAYER DATA REUSE Sangkug Lym 1 Armand Behroozi 2 Wei Wen 3 Ge Li 1 Yongkee Kwon 1 Mattan Erez 1 arxiv:181.37v2 [cs.lg] 16 Nov 218 ABSTRACT Training convolutional

More information

Model Compression. Girish Varma IIIT Hyderabad

Model Compression. Girish Varma IIIT Hyderabad Model Compression Girish Varma IIIT Hyderabad http://bit.ly/2tpy1wu Big Huge Neural Network! AlexNet - 60 Million Parameters = 240 MB & the Humble Mobile Phone 1 GB RAM 1/2 Billion FLOPs NOT SO BAD! But

More information

Escher: A CNN Accelerator with Flexible Buffering to Minimize Off-Chip Transfer

Escher: A CNN Accelerator with Flexible Buffering to Minimize Off-Chip Transfer Escher: A CNN Accelerator with Flexible Buffering to Minimize ff-chip Transfer Yongming Shen Stony Brook University yoshen@cs.stonybrook.edu Michael Ferdman Stony Brook University mferdman@cs.stonybrook.edu

More information

A Configurable Architecture for Sparse LU Decomposition on Matrices with Arbitrary Patterns

A Configurable Architecture for Sparse LU Decomposition on Matrices with Arbitrary Patterns A Configurable Architecture for Sparse LU Decomposition on Matrices with Arbitrary Patterns Xinying Wang, Phillip H. Jones and Joseph Zambreno Department of Electrical and Computer Engineering Iowa State

More information

Arm s First-Generation Machine Learning Processor

Arm s First-Generation Machine Learning Processor Arm s First-Generation Machine Learning Processor Ian Bratt 2018 Arm Limited Introducing the Arm Machine Learning (ML) Processor Optimized ground-up architecture for machine learning processing Massive

More information

DNN ENGINE: A 16nm Sub-uJ DNN Inference Accelerator for the Embedded Masses

DNN ENGINE: A 16nm Sub-uJ DNN Inference Accelerator for the Embedded Masses DNN ENGINE: A 16nm Sub-uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough 1,2 S. K. Lee 2, N. Mulholland 2, P. Hansen 2, S. Kodali 3, D. Brooks 2, G.-Y. Wei 2 1 ARM Research, Boston,

More information

A SIMULINK-TO-FPGA MULTI-RATE HIERARCHICAL FIR FILTER DESIGN

A SIMULINK-TO-FPGA MULTI-RATE HIERARCHICAL FIR FILTER DESIGN A SIMULINK-TO-FPGA MULTI-RATE HIERARCHICAL FIR FILTER DESIGN Xiaoying Li 1 Fuming Sun 2 Enhua Wu 1, 3 1 University of Macau, Macao, China 2 University of Science and Technology Beijing, Beijing, China

More information

Brainchip OCTOBER

Brainchip OCTOBER Brainchip OCTOBER 2017 1 Agenda Neuromorphic computing background Akida Neuromorphic System-on-Chip (NSoC) Brainchip OCTOBER 2017 2 Neuromorphic Computing Background Brainchip OCTOBER 2017 3 A Brief History

More information

Recurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks

Recurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks Deep neural networks have enabled major advances in machine learning and AI Computer vision Language translation Speech recognition Question answering And more Problem: DNNs are challenging to serve and

More information

FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review

FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review Date of publication 2018 00, 0000, date of current version 2018 00, 0000. Digital Object Identifier 10.1109/ACCESS.2018.2890150.DOI arxiv:1901.00121v1 [cs.ne] 1 Jan 2019 FPGA-based Accelerators of Deep

More information

A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup

A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup Yan Sun and Min Sik Kim School of Electrical Engineering and Computer Science Washington State University Pullman, Washington

More information

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference FINN: A Framework for Fast, Scalable Binarized Neural Network Inference Yaman Umuroglu (XIR & NTNU), Nick Fraser (XIR & USydney), Giulio Gambardella (XIR), Michaela Blott (XIR), Philip Leong (USydney),

More information

A Lightweight YOLOv2:

A Lightweight YOLOv2: FPGA2018 @Monterey A Lightweight YOLOv2: A Binarized CNN with a Parallel Support Vector Regression for an FPGA Hiroki Nakahara, Haruyoshi Yonekawa, Tomoya Fujii, Shimpei Sato Tokyo Institute of Technology,

More information

High Performance Computing

High Performance Computing High Performance Computing 9th Lecture 2016/10/28 YUKI ITO 1 Selected Paper: vdnn: Virtualized Deep Neural Networks for Scalable, MemoryEfficient Neural Network Design Minsoo Rhu, Natalia Gimelshein, Jason

More information

Speculations about Computer Architecture in Next Three Years. Jan. 20, 2018

Speculations about Computer Architecture in Next Three Years. Jan. 20, 2018 Speculations about Computer Architecture in Next Three Years shuchang.zhou@gmail.com Jan. 20, 2018 About me https://zsc.github.io/ Source-to-source transformation Cache simulation Compiler Optimization

More information

A flexible memory shuffling unit for image processing accelerators

A flexible memory shuffling unit for image processing accelerators Eindhoven University of Technology MASTER A flexible memory shuffling unit for image processing accelerators Xie, R.Z. Award date: 2013 Disclaimer This document contains a student thesis (bachelor's or

More information

Adaptable Computing The Future of FPGA Acceleration. Dan Gibbons, VP Software Development June 6, 2018

Adaptable Computing The Future of FPGA Acceleration. Dan Gibbons, VP Software Development June 6, 2018 Adaptable Computing The Future of FPGA Acceleration Dan Gibbons, VP Software Development June 6, 2018 Adaptable Accelerated Computing Page 2 Three Big Trends The Evolution of Computing Trend to Heterogeneous

More information

Versal: AI Engine & Programming Environment

Versal: AI Engine & Programming Environment Engineering Director, Xilinx Silicon Architecture Group Versal: Engine & Programming Environment Presented By Ambrose Finnerty Xilinx DSP Technical Marketing Manager October 16, 2018 MEMORY MEMORY MEMORY

More information

An Adaptable Deep Learning Accelerator Unit (DLAU) for FPGA

An Adaptable Deep Learning Accelerator Unit (DLAU) for FPGA An Adaptable Deep Learning Accelerator Unit (DLAU) for FPGA N. Sireesha 1 & P.Malleswari 2 1PG Scholar, Dept of ECE, Narsaraopeta Institute of Technology, Yellamanda, Narsaraopeta, Guntur district, Andhra

More information

Hardware Acceleration for Machine Learning

Hardware Acceleration for Machine Learning 2017 IEEE Computer Society Annual Symposium on LSI Hardware Acceleration for Machine Learning Ruizhe Zhao, Wayne Luk, Xinyu iu Department of Computing, Imperial College London, United Kingdom Huifeng Shi,

More information

Hello Edge: Keyword Spotting on Microcontrollers

Hello Edge: Keyword Spotting on Microcontrollers Hello Edge: Keyword Spotting on Microcontrollers Yundong Zhang, Naveen Suda, Liangzhen Lai and Vikas Chandra ARM Research, Stanford University arxiv.org, 2017 Presented by Mohammad Mofrad University of

More information

Understanding Sources of Inefficiency in General-Purpose Chips

Understanding Sources of Inefficiency in General-Purpose Chips Understanding Sources of Inefficiency in General-Purpose Chips Rehan Hameed Wajahat Qadeer Megan Wachs Omid Azizi Alex Solomatnikov Benjamin Lee Stephen Richardson Christos Kozyrakis Mark Horowitz GP Processors

More information

High-End Computing Trends Speed Scalability Efficiency

High-End Computing Trends Speed Scalability Efficiency INSTITUTE OF COMPUTING TECHNOLOGY High-End Computing Trends Speed Scalability Efficiency Zhiwei Xu Institute of Computing Technology (ICT) Chinese Academy of Sciences http://novel.ict.ac.cn/zxu/ zxu@ict.ac.cn

More information

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference FINN: A Framework for Fast, Scalable Binarized Neural Network Inference Yaman Umuroglu (NTNU & Xilinx Research Labs Ireland) in collaboration with N Fraser, G Gambardella, M Blott, P Leong, M Jahre and

More information

A Communication-Centric Approach for Designing Flexible DNN Accelerators

A Communication-Centric Approach for Designing Flexible DNN Accelerators THEME ARTICLE: Hardware Acceleration A Communication-Centric Approach for Designing Flexible DNN Accelerators Hyoukjun Kwon, High computational demands of deep neural networks Ananda Samajdar, and (DNNs)

More information