Instruction Driven Cross-Layer CNN Accelerator with Winograd Transformation on FPGA

Size: px

Start display at page:

Download "Instruction Driven Cross-Layer CNN Accelerator with Winograd Transformation on FPGA"

Marjorie Hicks
5 years ago
Views:

1 Instruction Driven Cross-Layer CNN Accelerator with Winograd Transformation on FPGA Abstract In recent years, Convolutional Neural Network (CNN) has been widely applied in computer vision tasks. FPGAs have been widely explored to accelerate CNNs due to its high performance, high energy efficiency, and flexibility. By fusing multiple layers in CNN, the intermediate data transfer can be reduced. With a faster algorithm using Winograd transformation, the computation of convolution can be further accelerated. However, previous accelerators with cross-layer or Winograd algorithm are designed for a particular CNN model. The FPGA should be reprogrammed when running another CNN model on the hardware. In this work, we design an instruction driven CNN accelerator supporting Winograd algorithm and cross-layer scheduling. We firstly modify the cross-layer loop unrolling order to extract basic operations as instructions, and then improve the on-chip memory architecture for higher computation units utilization rate in Winograd computation. We evaluate the hardware architecture and scheduling policy on Xilinx Virtex-7 690t FPGA platform. As a case study, the intermediate data transfer can be reduced by 90% on VGG- D CNN model with cross-layer policy. The performance of our hardware accelerator reaches 1467 GOP/s with 190k logic cells. Experimental results show that compared with state-of-the-art accelerators on the same platform, our design achieves 1.5 speed up with the same number of logic cells. The performance can be further improved by 78% if larger Winograd transformation sizes are used. Index Terms FPGA, CNN, Winograd, Instruction, Crosslayer I. INTRODUCTION Convolutional neural network (CNN) has achieved significant improvement in image recognition and detection and is thus widely applied in computer vision tasks, like object tracking and people re-identification [1]. CNN has even proved its potentials in regions other than image processing, such as language processing [2]. However, CNN requires a large computation complexity and a huge storage capacity compared with conventional algorithms. For example, convolution layers of VGG-E [3] have 20MB parameters. A single forward pass of VGG-E needs more than 39G operations. Thus, specific hardware architectures are designed to accelerate CNN based on ASIC [4] [7] or FPGA [8] [10]. The speed of an accelerator is the key measurement for a CNN accelerator design. The parallelism which is proportional to the number of computation units, and the utilization ratio which is affected by the memory system decides how fast a CNN accelerator can be. The energy cost of each operation, equally the speed divided by the power is used to measure the efficiency of a CNN accelerator design. The energy cost consists of two parts, memory access energy and computation energy. By optimization on computation and data access behavior, both speed and energy efficiency of a CNN accelerator can be improved. Besides that, the flexibility of a CNN accelerator is important to support various network structures. Fast convolution algorithms, such as Winograd transformation [11] have been used to speed up the operation of CNN. The most commonly used convolution kernels are 3 3 and Winograd transformation for 3 3 convolution with 4 4 transformation tile mathematically saves 56% computation, which means 2.25 speed up over a straight forward implementation. [12] designs an architecture using Winograd to accelerate CNN. However, the design will be no longer efficient if the cross-layer is to be applied scheduling due to the long pre-loading time of the memory system (line buffer). The large feature map and great number of weight for CNN requires large memory such as off-chip DRAM. Off-chip data transfer contributes to most of the memory system energy and sometimes stalls the computation units as its bandwidth is quite limited. Traditional implementations of CNN schedule the network layer by layer. In this way, the result of each layer should be written back to external memory when onchip memory is limited. For example, in VGG-D model, this number is 18.4M. The transformation time of this part is 20% of the computation time [10]. [10] proposes a scheduling method to compute different layers at a time so that the transfer of intermediate data can be largely reduced. However, this design is for a specific network and how to merge the layers is not fully discussed. To address the above problems, we propose an instruction driven CNN accelerator with the following contributions: We optimize cross-layer strategy for instruction support and propose a network dividing method to minimize intermediate data transfer. We improve the on-chip memory architecture for highefficiency Winograd with cross-layer scheduling. We design an accelerator system for CNN supporting instructions, cross-layer, and Winograd. Experimental results show that our design is 7 faster than previous cross-layer FPGA accelerator [10], and our logic cells efficiency is 1.5 higher than the state-of-the-art FPGA CNN accelerator. This rest part of this paper is organized as follows. Section II introduces the background of CNN and Winograd transformation. Section III introduces the previous work. Section IV illustrates the overall mapping flow.section V proposes a network dividing method to minimize intermediate data

2 transfer. Section VI introduces our loop unrolling strategy to implement cross-layer scheduling based on instructions. Section VII improves the buffer structure and implements our CNN hardware architecture supporting instructions, crosslayer, and Winograd. Section VIII compares the results with the state-of-the-art FPGA accelerators. Section IX concludes our paper. A. Basic Operations of CNN II. BACKGROUND A typical CNN consists of a number of layers running in sequence. The data between different layers is called featuremaps. There are kinds of layers in CNN: convolution layer, nonlinear layer, pooling layer and others. Convolution layer processes 2-d convolution with trained weights. Convolution layers are usually cascaded to extract high-level features of input. Convolution layers consume most of the computation of CNN. Nonlinear layer applies nonlinear activation function to each element of the output feature maps of convolution layers. Nonlinear layers increase the fitting ability of CNN. Rectified Linear Unit (ReLU) is a widely used nonlinear activation function in CNN. Pooling layer helps increase the receptive field of output element and reduce the computational complexity of CNN. A pooling layer outputs the max or average value of all elements in an area of the input featuremaps. B. Introduction of Winograd Convolution layers consume most of the computation in CNN. There are several methods to reduce computation complexity of 2-d convolution. Winograd s minimal filtering algorithm is a fast method to compute 2-d convolution especially when the convolution filters go small like 3 3 in CNN, and the Winograd algorithm uses a transformation to compute convolution [11]. For example, to compute a convolution for which the image tile is 4 4 and the kernel tile is 3 3, the Winograd algorithm uses 16 multiplications. However, the standard algorithm of the same size uses 36 multiplications. The Winograd algorithm achieves a 2.25 speedup with the same number of multiplications [13]. Winograd algorithm can be written in matrix form as: Y = A T [( GgG T ) ( B T db )] A (1) Where the indicates elementwise multiplication. The matrices are: [ ] B T = G = ] [ A T = [ ] g is the 4 4 input tile, d is the 3 3 convolution kernel. The multiplication with these transformation matrix elements can all be converted to shift operations(like 1 2 ), which is very friendly to hardware. The larger the Winograd tile is, the acceleration rate of Winograd is higher. However, transformation matrices of a (2) tile greater than 4 4 contain elements that cannot be simply converted to shift operations( like 1 5 in a 5 5 Winograd tile). For this reason, some approximation should be done ( = ). The approximation will result in precision degradation in CNN, especially in detection tasks. We set our Winograd tile size 4 4 to keep the precision of CNN, with a 2.25 acceleration rate to the original standard convolution. A. Architecture Design III. RELATED WORK GPU platforms are especially suitable for training CNN models, as they are highly optimized for matrix-matrix multiplication operations. However, when doing inference tasks, the data usually needs to be processed in a non-batch manner, which leads to the degradation of the speed-up ratio of GPU. Also, the high cost, the power consumption, and the lack of portability of GPU platforms motivate the research and development of many dedicated CNN accelerators. These accelerators use different strategies to map computation to hardware, exploiting different parallelism and data reuse pattern. In general, there are four types of loop optimization techniques in these hardware accelerators [14], exploiting different types of parallelism: kernel spatial, data spatial, input channel, and output channel. Convolver: Qiu [8] proposes an architecture consisting of hardware convolvers: Each PE is a convolver that includes 9 multipliers and an adder tree. Chen [6] divides the convolution into 1-D convolution primitives and uses 1-D convolver PE. Lu [12] implements an FPGA accelerator with convolvers of Winograd transformation and reaches 3T ops performance. MAC: Multiply-accumulate (MAC) units are widely used in CNN accelerators. This kind of hardware architecture does not impose the restriction on the kernel size. Du [15] proposes an architecture using MAC, which involves inter-pe data flow for data reuse. ENVISION [7] is an architecture uses parallel MAC and a parallel buffer array and performs at T OP S/W. M V: Matrix-vector multiplication (M V) is widely used in software implementations of convolution layers. Han [16], [17] implements a M V architecture that utilizes the sparsity, which can be used in a large range of applications like CNN, RNN, and LSTM. Chen [4], [5] and Liu [18] divide the computation of CNN into M V operations and propose a series of architectures and corresponding instruction set. Our implementation utilizes the Winograd convolution algorithm, in which more results are computed using the same amount of operation. B. Scheduling Strategy Some accelerators assume the on-chip memory is enough for all the weights and intermediate data of a CNN model, as in the Diannao series [4], [15]. Other designs [6], [8] that utilize off-chip memory discuss the scheduling or data arrangement strategy to overcome the bandwidth limitation between the external and the on-chip memory. The work focuses on the intra-layer scheduling. Recent researches start to focus on

multi-layer scheduling to reduce data transfer.

the 95% data transfer. Li [19] implements a pipelined multi-layer CNN accelerator for AlexNet [20].

larger networks or deployed on smaller platforms.

3 multi-layer scheduling to reduce data transfer. Alwani [10] uses HLS to implement and evaluate a pipelined multi-layer CNN accelerator for the first five convolution layers for VGG- E and showed that the multi-layer scheduling strategy can reduce the 95% data transfer. Li [19] implements a pipelined multi-layer CNN accelerator for AlexNet [20]. In these multilayer accelerator implementations, the computing modules of all layers are put on the hardware, which consume computing resources, making this kind of solution hard to be adapted to larger networks or deployed on smaller platforms. Also, these solutions are not flexible enough to be adapted to new network structure easily, as they explicitly designed the modules for each layer and the connection between each module. Winograd transformation accelerates CNN on calculation unit, and cross-layer strategy reduces intermediate data transfer on scheduling. We combine these two methods into an instruction driven hardware, and make the accelerator more efficient and flexible. IV. OVERALL WORK FLOW Figure 1 illustrates the work flow of our system. We firstly analyze the target CNN model and decide which layers to fuse. We propose a network dividing method named TMD method to minimize the intermediate data transfer. After fusing layers into several layer blobs, we generate instructions for each layer blob in cross-layer strategy. With the help of cross-layer, there is no data transfer inside a layer blob. We finally execute all of the instructions on the FPGA platform. Fig. 1. The work flow of our system. V. NETWORK DIVISION In this section, we introduce the network division method to minimize the data transfer of off-chip memory. We name this method a T ransfer Minimum Division (TMD) method. We see that the data volume distribution of each layer is extremely nonuniform. Figure 2 shows the input, output, and kernel data size of each layer for VGG-D network. For example, the convolution layer 1 requires 150KB of input and produces 3.4MB of data output feature maps. The output data are used to compute the following layer. Since the on-chip memory is limited on FPGA, an architecture usually cannot provide such large on-chip memory for all the output data. Therefore, data transfer occurs to store output data to Fig. 2. The input, output and kernel data size of each layer of VGG-D, and the data size of the first Layer Blob across layer 1 and layer 2. off-chip memory, and fetch output data from off-chip memory for the following computation. However, we notice that the output of the layer 2 in VGG-D net is only 750KB, which can be stored in the on-chip memory of our target FPGA platform (Xilinx Virtex-7 690T), so we don t maintain the whole output of the layer 1. We directly compute the output of layer 2 using the partial output of layer 1. In this way, we consider layer 1 and layer 2 as a whole Layer Blob and eliminate the data transfer. A. Hardware Constraints As described above, the weights of the Layer Blob are the weights of all layers in the blob. Layers can be merged into a Layer Blob (whose begin layer id is i and end layer id is j) when all the weights of the layers can be stored in the weight buffer on-chip as Equation (3). The volume of weights of the i th layer is weight k and the size of the on-chip weight buffer is Buffer weight. i k<j weight k Buffer weight (3) There is no intermediate featuremap transfer inside a Layer Blob. The data volume to be transferred at the border of each Layer Blob is decided by the size of the output featuremap. The Blob transfers no data when the output featuremap (Data out ) can be stored on-chip, or double of the output featuremap when it cannot be stored on-chip. Because the data transferred to the off-chip memory should be fetched back on

4 chip for the following computation. The data transfer of the Layer Blob ending with layer i is noted as BT i : BT i = B. Dividing method { 0 Dataout,i Buffer data 2 Data out,i Data out,i > Buffer data (4) We can define the description function of total data transfer (DT (M)) under a division method (M). It is the sum of data transfer in each Layer Blob. DT (M) = BT i (5) i blobs Our goal is to find a solution M under the hardware constraints governed by Equation (3), such that: DT (M ) = min(dt (M)) (6) We convert the CNN model to a directed acyclic graph (DAG) naturally. In the DAG, a node represents the output of a certain layer. Each node has its weight. The weight means the amount of the data transfer (BT i ) here given in Equation (4). We add a property lastnode to each node. lastnode indicates whether each node is the end layer of a Layer blob or not, and if this node is the end layer of a Layer Blob, lastnode also indicates the beginning layer of the same layer blob. The lastnode of all nodes form the dividing method M of a CNN net work. We can find a collection of subproblems of it. For the i th node, define the subproblem P (i) as: To find the optimal division solution for the graph before the i th node. It can be easily proved that the problem has the optimal substructure. Thus it can be solved using dynamic programming. To analyze the problem, we define function P (i) of the subproblem as follow: H(i) = {x x i} (7) H(i) is the set of nodes that before the ith node. P (i) = min (DT (M)) (8) M H(i) It is obvious that for the last node n, we have: P (n) = DT (M ) (9) So our goal is to find P (n) and the corresponding M. The Bellman Equation (which is a basic function in dynamic programming [21]) of the subproblem P (i) can be written as: P (i) = min j (P (j) + BT j ) (10) where j is a node before i and i, j satisfy the hardware constraints in Equation (3). Extra data transfer BT j is added to the total data transfer at the node j. With the given optimal substructure and Bellman Equation, we use dynamic programming to find out the target P (n). For each node in the network, the optimal division solution is one of the optimal division solutions for node j plus the weight at Algorithm 1 TMD: Transfer minimum division method 1: bt[i] is the BT in eq. (4), and lastnode gives a division method M in eq. (5). 2: P [0]=0 3: lastnode[0]=0 4: for i = 1 : n 1 do 5: P [i]=+ 6: for j = 0 : i 1 do 7: if Check (i, j) then 8: if P [j] + bt[j] < P [i] then 9: P [i] = P [j] + bt[j] 10: lastnode[i] = j 11: Layer Blob List = divide (lastnode) node j. Thus we can describe our process using pseudo codes in Algorithm 1. The Check () function checks the satisfiability with Equation (3). At each loop, a new layer is added to the search range. Repeating the steps until the last node of the network leads to the optimal division solution. The network division method shows the minimum data transfer under the constraint of on-chip memory and presents the division of layers to Layer Blobs. VI. CROSSLAYER WITH INSTRUCTIONS A. Cross-layer Strategy Previous work [10] explores the dimension in the design space of CNN accelerators that focuses on the data flow across convolution layers, and reduces 95% intermediate data transfer. The scheduling flow of the fuse layer in [10] is illustrated in algorithm 2. Algorithm 2 conventional convolution layer in [10] 1: in ch and out ch : the number of input and output channels 2: for (row = 0; row < H; row+ = S) do 3: for (col = 0; col < W ; col+ = S) do 4: for (layer = 0; layer < L; layer + +) do 5: for (fi = 0; fi < N i ; fi + +) do 6: for (fo = 0; fo < N o ; fo + +) do 7: for (i = 0; i < k x ; i + +) do 8: for (j = 0; j < k y ; j + +) do 9: out[row][col][fo]+= 10: weight[fi][fo][i][j]* 11: in[row+i][col+j][fi] For a net with L convolution layers, a conventional computation algorithm is shown in algorithm 2. The N i input featuremaps with size W H are convolved with k x k y weights in order to get the N o output feature maps. [10] modifies the traditional loop unrolling order that put the loop of the layer inside the loop of row and column(line 4). Because the computation structure of each layer differs from others, it designs different computation units and data reuse units for

5 different layers. The FPGA accelerator in [10] can only run a specific network and needs to be reprogrammed to run another network. We modify the loop unrolling strategy in [10] to make the cross-layer scheduling policy more flexible. We propose the cross-layer scheduling strategy in line, which is shown in algorithm 3. Algorithm 3 convolution layer computation with Winograd 1: for (row = 0; row < H; row+ = 2) do 2: for (layer = 0; layer < L; layer + + do 3: for (fo = 0; fo < N o ; fo + +) do 4: for (fi = 0; fi < N i ; fi + +) do 5: for (col = 0; col < W ; col+ = S) do 6: out[row][col][to]+= 7: Winograd(in,weight) The row is regarded as the outermost loop variable (line 1). Each time the computation window moves down by two lines and performs the computation through the whole Layer Blob. Furthermore, convolution operations are done by the fast Winograd computation kernel. Figure 3 shows our cross-layer scheduling strategy with an example of considering two convolutional layers as a whole. The size of the input featuremaps of the 1st layer is 7 7, the size of the convolution kernels of both layers is 3 3. Layer 1 operates on the line 1-3 of featuremap 1 as its input and convolves all the convolution kernels across these three lines, producing line 1 of featuremap 2. After line 1 of featuremap 2 being produced, the line 1 of layer 1 can be discarded. Then layer 1 operates on line 2-4 of featuremap 1 to produce line 2 of featuremap 2. With the same method, line 3 and line 4 of featuremap 2 and be calculated. As soon as line 1-3 of featuremap 2 is produced, layer 2 operates on these 3 lines of featuremap 2 to produce line 1 of featuremap 3. After line 1 of featuremap 3 being produced, the line 1 of layer 2 can be discarded. Generally, a line of a featuremap can be discarded as soon as a new line of new featuremap is computed. After finishing the computation of line 1 of featuremap 3, there is no need to compute a new three lines of featuremap 2 to prepare data for line 2 of featuremap 3. Because the computation of line 1 and line 2 of featuremap 3 shares the input data line 2 and line 3 of featuremap 2, we only compute a new line (line 4) of featuremap 2. For the same reason, we only need a new line in featuremap 1. In conclusion, to compute a new line of the output featuremap, only one new line is needed in the input featuremap. The new line of each input and intermediate featuremap can be stored in the same place of the former line recently discarded. In this strategy, there is no need to store an entire featuremap on chip. We only need to store K lines of each featuremap on chip, where K is the size of convolution kernel (k = 3 in this example). The convolution layers transfer no intermediate featuremaps under the cross-layer strategy. We name these layers as Layer Fig. 3. Cross-layer scheduling example of two layers Blob. It should be noticed that we assume all the kernel weights of the convolution layers can be stored on chip. If the kernel weights are too large, the cross-layer strategy loses efficiency due to weights transfer, and we compute CNN layer after layer in sequence in this condition. B. Instructions An instruction executes a basic operation of the hardware. As described in section VI-A, the basic operation of our crosslayer scheduling policy is the calculation of an output line of a layer, i.e. a calculation instruction finishes line 5-7 in algorithm 3, and the number of columns W is given in the instruction. We implement a compiler to generate instructions in cross-layer scheduling for each layer group. Besides calculation instructions, we also design instructions for data transfer. Data transfer instructions and calculation instructions can be executed in parallel. The addresses in calculation and data transfer instructions allocate the on-chip memory to the CNN featuremaps and weights for different CNN layers. VII. HARDWARE DESIGN In this section, we propose a structure to combine Winograd and Cross-layer scheduling together. A. Architecture Overview Figure 4 presents the architecture overview of our implementation of Winograd and cross-layer supporting on FPGA.

6 Fig. 4. Architecture overview To store input data and output data in the same buffer, the hardware parallelism of input channels and output channels should be the same. An input register file is designed to shuffle data in case of zero-padding. Weights are fetched from Weight Buffer in 3 3 size. PEs compute eq. (1) with the input tile and weights. We also design an output register buffer data in case of pooling. In our design, there are several Buffer Pools and Weights Buffer for featuremaps and weights in different channels. For easy illustration, we assume there is only one Buffer Pool and Weights Buffer, And the hardware parallelism is 1 1. B. Data Buffer Design Previous work usually adapts Line Buffer structure for data reuse. Figure 5 shows the basic structure of line buffer. Initially, the line buffer will read the first M(= 2 in this figure) lines from input data. PE will stay idle until the M lines of input data are fully read. The first line of the result is computed synchronously while reading the M+1 line of input data. Considering the scheduling of cross-layer strategy mentioned in section VI, the basic operation is to calculate a line of the output featuremap. After finishing a line of the output featuremap, the next operation may start to compute on a new featuremap, and the PE will return idle waiting for the M lines of this featuremap. In cross-layer scheduling, the redundance of reading first M lines will keep PE idle in most of the time, resulting in low utilization of PE. However, line buffer is suitable for weights buffer, for the lines of weights are much shorter. The redundance of the two lines of weights can also be covered by pre-loading and registering the weights. In order to support cross-layer scheduling, the input data buffer and the output data buffer should be merged into a same buffer, called Buffer Pool in our implementation. Each buffer pool consists of 4 RAMs, and Each ram provides 4 input data a clock cycle. The addresses of RAMs is given Fig. 5. Line buffer structure. PE stay idle while loading the first M lines. Not good for cross-layer scheduling. in instruction, and instruction also provides the write back addresses of RAMs to store the result of convolution layer. C. Winograd PE Design Figure 6 shows the structure of PEs. The calculation of convolution layer can be divided into four stages. The first stage is to do the transformation of input tile and weights. Note that the transformation of input data and weights can be divided into two constant matrix multiplication, the second multiplication of input tile is arranged in DSPs. The second stage is the element-wise multiplication of transformed input tile and weights. We use DSPs to perform this operation. The third stage is the transformation of result. The forth stage is to accumulate the output tiles from different input channels. Fig. 6. PE design We notice that DSPs can do more than element-wise multiplication. In our design, we set DSPs in the mode of (A + B) D, which can calculate the B T and the to make full use of DSPs. The computation on hardware also parallelizes the loops of input channel and output channel. We define the unroll factors of input channel and output channel are P i and P o. Therefore, there are a total of P i P o PEs in the hardware design. Moreover, in order to store input data and output data in the same buffer, P i is the same with P o. Each PE consists of 16 DSPs. Therefore, P o and P i are constrained by the number of DSPs on FPGA platform as eq. (11). The number of DSPs is described as DSP. P i P o 16 DSP (11)

7 VIII. EVALUATION A. Network Division Evaluation We evaluate our network division strategy on different conditions of the on-chip memory for intermediate data and wights. VGG-D network (illustrated in Figure 2) is chosen as the benchmark. The size of data needed to be transferred with and without network division (TMD) is shown in Table I TABLE I INTERMEDIATE DATA TRANSFER ON DIFFERENT ON-CHIP MEMORY CONDITIONS WITH AND WITHOUT TMD. On-chip memory Transfer Transfer Transfer condition 1 Strategy data reduced reduced (MB) (MB) percentage data=2mb Plain 6.60 weight=2mb TMD % data=1.2mb Plain 9.81 weight=1.2mb TMD % data=1.2mb Plain 9.81 weight=0.6mb TMD % data=0.6m Plain weight=1.2mb TMD % data=0.6mb Plain weight=0.6mb TMD % 1 The on-chip memory condition means data and weight buffer size on chip. 2 Intermediate data transfer is eliminated except for the input and output of the whole CNN. We can conclude that the size of the data buffer decides the data transfer without network division. With the help of a larger data buffer, the data transfer can be reduced by computing a Layer Blob across more layers. In FPGA evaluation, the size of the data buffer and the weight buffer is 2MB. We eliminate intermediate data transfer in the following FPGA evaluation. We remind the readers again that our cross-layer scheduling and network dividing method eliminate data transfer of intermediate featuremaps. The data transfer of weights of convolution layers can not be reduced. B. FPGA Experiment Setup We evaluate our hardware architecture on the FPGA platform Xilinx Virtex-7 XC7V690T. The external memory of this platform consists of 3600 DSPs and two 1GB DDR3. According to eq. (11), P i = P o 15. To make full use these two DDRs, we set P i and P o to 8 for one CNN accelerator, and we implement two CNN accelerators on the FPGA chip. In this way, the Batch of our hardware is 2. We use PCIe interface to initial DDRs on board and control the CNN accelerators on board. The system is operated at 200M hz frequency. The whole system costs 250k logic cells on board, and the interface control units (PCIe and MIG) consume 60k logic cells. [8], [12] use Xilinx Zynq SOCs and do not need to implement PCIe and MIG control units on FPGA fabric. [10] simulates its core system without Peripherals. We compare our work to previous work excluding these peripherals control units. It is believed that 8-bit fixed point number is enough to quantize feature maps and weights of CNN [8]. So we use 8 bit for data storage and computation. It should be noted that this work concentrates on accelerating convolution layers of CNN. The results of our evaluation are based on the convolution layers of different CNN models. The other operations like FC are completed on CPU. C. Cross-layer vs Plain Scheduling We evaluate our CNN accelerator on VGG-A, VGG-D and a detection model YOLO [22]. We generate instructions for these models to run different CNN networks on the same hardware platform. Table II shows the performance comparison between scheduling policy with and without cross-layer optimization. TABLE II PERFORMANCE COMPARISON BETWEEN STRATEGY WITH AND WITHOUT TMD Problem Complexity (GOP) Weights Transfer(MB) Intermediate Data Transfer(MB) Total Trasfer(MB) Total Time(ms) YOLO 1 VGG-A VGG-D Plain/TMD Plain TMD Plain TMD Speed up - 12% 7% 1 All intermediate data of any YOLO layer can be stored on board,and the scheduling of with and without cloasslayer is equal. As is shown in the results, around 12% of acceleration is achieved by adapting cross-layer scheduling on a small model VGG-A, and 7% on a larger model VGG-D It should be noticed that when the network goes large, the weights of Convolutions overwhelms the intermediate data. So the cross-layer speed-up ratio of a larger model (VGG-D) is lower than the speed-up ratio of a smaller model(vgg-a). However, when the intermediate data of a single layer goes small, our cross-layer strategy equals to the plain strategy. D. Comparison With Previous Work The performance comparison with other FPGA work is shown in Table III. Our work achieves state-of-the-art performance on FPGA. Moreover, our design is an instruction driven accelerator, providing flexibility for different CNN models. We improve the performance of convolution layers from 566 GOP/s to 1467 GOP/s. It should be noticed the Winograd kernel of our design is 4 4 and accelerates convolution The Winograd kernel of [12] is 6 6 and accelerates convolution 4. As described in section II, we choose 4 4 Winograd kernel to guarantee the precision of CNN models. We can also use a larger Winograd

8 TABLE III PERFORMANCE COMPARISON WITH STATE-OF-THE-ART FPGA ACCELERATORS [8] [9] [19] [10] [12] Ours Platform Zynq XC7Z045 Virtex 7 VX485T Virtex 7 VX690T Virtex 7 VX690T MPSOC ZCU102 Virtex 7 VX690T Clock(MHz) Data Format 16-bit fixed 32-bit float 16-bit fixed 32-bit float 16-bit fixed 8-bit fixed Network VGG-D AlexNet [17] AlexNet Problem Complexity(GOP) VGG-E layer 1-5 VGG-D VGG-D Batch Used DSPs Logic Cell(K) Performance(GOP/s) DSP Efficiency (GOP/s/DSP) Logic Cell Efficiency(GOP/s/cell) 1 The frequency is estimated to calculate performance kernel size like 6 6 for tasks requiring less precision.if we use 6 6 Winograd kernel in our design, we can achieve 78% more performance with the same number of DSPs, and achieve the same DSP efficiency with the state-of-the-art work [12]. IX. CONCLUSION In this paper, we propose an instruction driven CNN accelerator based on FPGA with the optimization of cross-layer scheduling and Winograd computing unit. A network transfer minimum division (TMD) method is presented to minimize the data transfer for general networks. We also design a compiler with cross-layer scheduling that maps CNN to accelerators with high flexibility. Then we propose a hardware architecture to support the Winograd computation and instruction set on a Xilinx Virtex-7 FPGA while achieving a 1.5 logic cell efficiency compared to state-of-the-art designs. REFERENCES [1] N. McLaughlin, J. Martinez del Rincon, and P. Miller, Recurrent convolutional network for video-based person re-identification, pp , [2] S. Zhang, C. Liu, H. Jiang, S. Wei, L. Dai, and Y. Hu, Feedforward Sequential Memory Networks: A New Structure to Learn Long-term Dependency, arxiv: [cs], [3] K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, arxiv preprint arxiv: , [4] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning, vol. 49, pp , ACM, [5] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, and others, Dadiannao: A machine-learning supercomputer, pp , IEEE Computer Society, [6] Y. H. Chen, T. Krishna, J. S. Emer, and V. Sze, Eyeriss: An Energy- Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks, IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp , [7] B. Moons, R. Uytterhoeven, W. Dehaene, and M. Verhelst, 14.5 Envision: A 0.26-to-10TOPS/W subword-parallel dynamic-voltage-accuracyfrequency-scalable Convolutional Neural Network processor in 28nm FDSOI, in ICCSS, pp , [8] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, N. Xu, S. Song, Y. Wang, and H. Yang, Going Deeper with Embedded FPGA Platform for Convolutional Neural Network, FPGA 16, pp , ACM, [9] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, Optimizing fpga-based accelerator design for deep convolutional neural networks, pp , ACM, [10] M. Alwani, H. Chen, M. Ferdman, and P. Milder, Fused-layer CNN accelerators, in MICRO, pp. 1 12, [11] S. Winograd, Arithmetic complexity of computations, vol. 33. Siam, [12] L. Lu, Y. Liang, Q. Xiao, and S. Yan, Evaluating fast algorithms for convolutional neural networks on fpgas, in FCCM, pp , IEEE, [13] A. Lavin and S. Gray, Fast algorithms for convolutional neural networks, pp , [14] Y. Ma, Y. Cao, S. Vrudhula, and J.-s. Seo, Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks, FPGA 17, pp , ACM, [15] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam, ShiDianNao: Shifting vision processing closer to the sensor, vol. 43, pp , ACM, [16] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, EIE: efficient inference engine on compressed deep neural network, pp , IEEE Press, [17] S. Han, J. Kang, H. Mao, Y. Hu, X. Li, Y. Li, D. Xie, H. Luo, S. Yao, Y. Wang, H. Yang, and W. B. J. Dally, ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA, FPGA 17, pp , ACM, [18] S. Liu, Z. Du, J. Tao, D. Han, T. Luo, Y. Xie, Y. Chen, and T. Chen, Cambricon: An instruction set architecture for neural networks, pp , IEEE Press, [19] H. Li, X. Fan, L. Jiao, W. Cao, X. Zhou, and L. Wang, A high performance FPGA-based accelerator for large-scale convolutional neural networks, pp. 1 9, IEEE, [20] V. Gokhale, J. Jin, A. Dundar, B. Martini, and E. Culurciello, A 240 g-ops/s mobile coprocessor for deep neural networks, pp , [21] R. Bellman, Dynamic programming and Lagrange multipliers, Proceedings of the National Academy of Sciences, vol. 42, pp , [22] J. Redmon and A. Farhadi, Yolo9000: better, faster, stronger, arxiv preprint arxiv: , 2016.

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Naveen Suda, Vikas Chandra *, Ganesh Dasika *, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, Yu