High-Level Synthesis of a Fast Super-Resolution Convolutional Neural Net

Size: px

Start display at page:

Download "High-Level Synthesis of a Fast Super-Resolution Convolutional Neural Net"

Evelyn Anderson
5 years ago
Views:

1 Justin Selig (jss459), Patrick Wang (pw349), Shaurya Luthra (sl2462) ECE 5775, Prof. Zhiru Zhang December 11th, 2017 Introduction High-Level Synthesis of a Fast Super-Resolution Convolutional Neural Net With Dennard scaling and Moore s law coming to an end, hardware specialization and system heterogeneity pave the way for the continual advancement of processing speed and power. Specialization becomes especially useful for specific, compute-heavy applications and enables new problems to be solved with the extra horsepower. One such application is machine learning, specifically that involving Convolutional Neural Networks (CNNs). While CNNs have many uses, one of its most prominent is for image processing and recognition. An image s two-dimensional data representation lends itself particularly well to convolution. One such use-case is Image Super-Resolution, the process of taking one (or several) low resolution (LR) images and creating a high resolution (HR) image. This can be useful for clarifying blurry or LR images from surveillance cameras, crops of images or videos, as well as providing more information for other types of applications that use images as an input such as facial recognition. The goal of our project was to implement a hardware accelerator for Image Super-Resolution focusing on using one LR input image. Furthermore, we chose the Fast Super Resolution CNN (FSRCNN) as described in [2]. As suggested by the name, the primary difference between FSRCNN and regular SRCNN is the speed at which the algorithm runs. Furthermore, FSRCNN has been shown to produce better results than standard SRCNN methods. This improvement in quality, as well as speed, however, comes at the cost of complexity - which will be further described below. Our project involved implementing the FSRCNN algorithm in C++ and further synthesizing hardware modules to run on an FPGA using the Xilinx SDSoC Development Environment. To do this, we had to consider principles in software-hardware co-design, hardware performance optimization, area/resource constraints, and evaluation methodology. These will be described in later sections. As a result of implementing this model, we were able to achieve results comparable to the state-of-the art software-only approach, however with higher throughput and lower latency using hardware. Techniques As described in the FSRCNN paper [2], the FSRCNN algorithm, while faster and more accurate than the SRCNN, is far more complex. While the SRCNN algorithms consists only of 3 stages - pre-processing, feature extraction, and mapping - the FSRCNN algorithm consists of 8 stages - 7 convolution stages, and 1 deconvolution stage. The weights, biases, and PReLU values are taken from the pre-trained models provided in [2]. The stages are illustrated and described in further detail below: Figure 1: FSRCNN Layers Visualized

2 Stage Description: Input Image: LR single channel. Conv. Layer 1: Feature extraction Conv. Layer 2: Shrinking Conv. Layers 3 6: Mapping Conv. Layer 7: Expanding DeConv Layer 8: Deconvolution The above 8 layers result in a total number of weights and a small number of parameters for the PReLU layers (1-7) Our goal was to take the above 8 layers and develop a software model for them using pre-trained model weights. We would then use these weights in our FSRCNN implementation and validate the model outputs using Peak Signal to Noise Ratio (PSNR) as way to quantify the quality of the results. In order to achieve an observable acceleration, we would try to offload as many layers as possible to the FPGA while remaining within resource limits. The equation for PSNR is shown below. 1 P SN R = 20 * log 10 ( N 2 error ) We chose to offload layers 3-6 because these layers have the same parameters and come right after one another. This is important since data may be immediately passed between hardware layers rather than communicated back to the CPU which takes a non-trivial amount of time. Additionally, since these layers are identical in structure, accelerating one layer effectively accelerates all layers performed in hardware (no need to test different HLS optimization techniques for each of the three layers). We chose not to move the deconvolution function to the FPGA becuase of the sheer size of computation. Layer 8, which uses deconvolution, uses 4536 weights (over ⅓ of our total number of weights across 8 layers) as well as a 9x increase in image data due to the upscaling inherent to deconvolution - meaning that a bottleneck in accelerating layer 8 would most likely be the data transfer: moving weights and input/output feature maps in and out of the FPGA as well as the CPU s memory system. A visualizatin of this can be seen in Figure 2. N n=1 Software Implementation Figure 2. System block diagram To start, we reviewed an existing Matlab implementation of FSRCNN provided in [1]. We used the Matlab output for numerical (PSNR) and visual performance comparisons. Furthermore, there were test images and the pre-trained model. From that model, we extracted the weights, biases, PReLU values, and input image values by converting them into 1D arrays and writing to.dat files. Because the MATLAB model used multidimensional arrays and transforms 2D matrices into 1D arrays column by column, some simple matrix transformations and transposes were used to list the data in the correct order. In order to match the convolution structure used in lab 4, output and input

3 feature maps are listed by column (x) then row (y) and weights are listed by output fmap, channel (input fmap), column, then row. This simplifies the indexing into our 1D arrays. From there, the same setup for convolution from lab 4 was used. All parameters are listed as 1D arrays to facilitate different map and filter sizes easily, with important arguments such as number of input fmaps, output fmaps, input edge size, output edge size, weights, biases, PReLU value. Within the function, nested loops are used to perform the convolution. The loop ordering is as follows: for each output fmap for each input fmap for each output fmap location for each filter location perform convolution Additionally, the MATLAB model used PReLU as the activation function instead of ReLU. ReLU adds a bias to the output and caps the lower end at 0. PReLU takes that value, and adds a PReLU bias multiplied with the min of the biased value. Mathematically, this plays out to be as follows. biased = output + weight //relu_output = max(biased,0) output = max(biased, 0) + prelu_weight*min(biased,0) This convolution function on its own, however, will not replicate the MATLAB calculations. This is due to the issue of padding. The convolution function in lab 4 (and consequently the one we used) requires the input fmap to be at least as big as the output fmap, based on filter size. The MATLAB model on the other hand uses the same sized input and output fmaps for each layer. This is because the MATLAB convolution method used pads the input to be big enough for the convolution based on the filter size. The thickness of the outer border b is related to the filter edge size K as follows: b = (K 1 )/2 The resultant padded input fmap now has the edge size I pad = I o + 2 * b. The borders are made up of the nearest original edge element. Essentially, the edges are replicated outward with the corners consisting of the original corners of the fmap. A pad function was written that did just that to match with the MATLAB model, with padding occurring before convolution layers. With the pad function and convolution functions, layers 1 through 7 could now be executed. Layer 8, however, was a deconvolution layer. Deconvolution, or transposed (strided) convolution, is a method of reversing the effect of convolution by padding the input between pixels and performing a direct convolution. This operation is used as the last stage in the FSRCNN network and applies the largest number of weights out of all the layers. The effect of this operation is that the output image ultimately returns to the size of the input. Since the input to this last layer is a dense set of pixels, the conversion to a more coarse representation via deconvolution is effectively a form of upsampling - reversing the shrinking effect of convolution by interpolating between input pixels via a strided convolution where the weights have been previously learned. The operation itself involves the temporary storage of a very large matrix - almost twice the size of the input. Since this layer has such a large number entries in its intermediate operation, it was infeasible to move it to the FPGA since this layer would require a massive amount of memory for storage.

4 The above functions, pad, convolution, and deconvolution, called with their respective layer parameters shown in Table 1, form the FSRCNN function, as well as the copying of input data into a temporary convolution memory, and the copying to the output data array from a temporary convolution memory. With this, the fsrcnn.cpp file was complete. F_Size I_Size I_Padded I_Count O_Count I_Total W_Total Biases PReLU Table 1. Expanded layer parameters Each function was incrementally verified by creating smaller test matrices and inputting them into both our C++ functions and the functions used in the MATLAB model. After incremental testing, the function as a whole was tested by using one of the test images from MATLAB and passing it through our function, then comparing our output data and MATLAB s output data. This was done by copying the data into a spreadsheet and calculating the magnitude of raw error between the two. A different file, main.cpp was used to call our FSRCNN method and input different test images. In addition to being where input data was managed, it was also in this file that the output data was compared to the original ground truth image. The error function used in the MATLAB model was peak signal-to-noise ratio, or PSNR, a metric typically used to measure the reconstruction fidelity of compression/decompression algorithms. This is essentially the magnitude in decibels of the square mean of the difference between the ground truth and our output. Using the MATLAB code as a reference, as well as the math.h library, we recreated our own PSNR function. A shave function was also created, that trimmed the border created by the deconvolution so the final output and the ground truth images dimensions matched. Software Results After running code on the ecelinux servers, we compiled and ran the code on the Xilinx ZC706 boards provided by the instructor. This run resulted in an average PSNR over 3 images (bird, butterfly, and head from the Set5 folder from [1]) of , from Table 2. Note, that the original MATLAB code from [1] results of a mean PSNR over the same 3 images of db for the FSRCNN method and db for the bicubic upscaling. Numerically, our C++ implementation gets a far worse PSNR than upscaling does in MATLAB. However, through experimentation, we discovered that if we take the raw floating point output of our FSRCNN function and do the rest of the processing in MATLAB (float to 8-bit integer conversion and shaving), the PSNR calculated is within 0.1 db of the MATLAB PSNR. The way MATLAB does the float to int conversions is believed to be the culprit of the mismatch.

5 Even when the byte version of our output is inspected visually (see Figure 4), it appears much closer to MATLAB s FSRCNN picture than the bicubic upscaled picture. Another note is that the total layer times do not sum to the total runtime. This is because the software timers are only put around padding and the actual layer itself. While the convolution itself is the operation of interest, the padding takes on the order of 1 msec to run, and was therefore deemed negligible to the overall timing if grouped together with the layers. Within the FSRCNN function, there is also data copying in and out of the function, which is not included. In the main function, there are also additional file writes, shaving, and PSNR calculations. PSNR (db) L1-2 Runtime L3-6 Runtime L7 Runtime L8 Runtime Total Runtime Hardware Implementation Table 2. Software results from the board To accelerate the system, we chose to move layers 3 through 6 into hardware. Because they have the same overall parameters (12 input channels, 12 output fmaps, 3 x 3 filter), they can share the same hardware module, saving area and development time. Additionally, in terms of layer count, they make up one half of all the layers. As single layers, they also have the 3rd most weights (see Table 1), meaning they comprise of a significant portion of the overall execution. In fact, looking at the runtime breakdown in Table 2,as a group, layers 3 through 6 take up 44.8% of the total fsrcnn time. In order to move these layers into hardware, we had to create two hardware functions: perform_conv_hw and perform_bias_hw. The single perform_conv function from before was broken down into two hardware modules in order to support the in order data access requirements from SDSoC. The first function we wrote for hardware was perform_conv_hw, followed by perform_bias_hw, and only used these functions for layer 3. In order to minimize data transfer, the weights necessary for layer 3 were stored locally on hardware. After running our first synthesis, we found that we could generate a bitstream, but it would immediately hang when running it on the ZC706. The reason for this - which was discovered with the help of Sean (TA) and Professor Zhang, was that we were accessing data out of order, and did not properly define our data interfaces. After realizing this we added the following pragma: #pragma SDS data access_pattern(input:sequential,output:sequential) And also used two for loops to copy in our input and output into/out of temporary arrays in order to preserve the sequential access pattern. After adding this pragma and resynthesizing, we found that our code still hung. This ended up being a result of partial data access. With SDSoC we cannot have an array of size M and access N elements when N!=M. We needed to keep the flexible array size in order to make our function modular for different images, so we then went on to use the following pragma: #pragma SDS data copy(input[0:12*(o+k-1)*(o+k-1)], output[0:12*o*o]) The above pragma allowed us to pass in a variable (pointer) to our array and access a variable amount of data dependant on function parameters (that could be resolved at compile time). With the above modifications we were able to run our code on the zc06 boards, but did not achieve much of a runtime boost. In looking at the VHLS reports, we saw a large amount of latency per function being used in our data copy from input into a temporary array, and from the temporary array to output. These copies were done to maintain in order data access as SDSoC does not support INOUT arrays as parameters to HW functions. In order to reduce latency we decided to restructure our convolution loop as shown below:

6 Figure 3. Loop order By reordering our loop and using a temporary variable to handle the accumulating sum per output pixel, we we were able to change our output from out of order access to in order access within our convolution, and got rid of our output copy loop. This in turn greatly reduced latency. The next and final optimizations we made were loop unrolling parts of the inner loops. More specifically we manually unrolled the two loops which iterated over filter pixels - decreasing runtime, as well as increasing accuracy by reducing truncation between individual convolution sums. Finally we unrolled our loop over each input map (m input maps) so that we could compute all the values needed for a single output pixel in parallel. Following these optimizations we attempted to pipeline the loop which iterates over the input columns. In order to do so we had to use 12 line buffers (parallel convolution over input maps), and unfortunately partitioning these line buffers made our resource utilization >100%, and not partitioning the line buffers actually worsened our overall minimum and maximum latencies. With this realization, we instead moved on to move layers 4-6 onto hardware as well - the results of which are discussed below: Hardware Results (Evaluation) After verifying that the design still compiled and ran in software, we ran HLS synthesis of our base hardware design as well as generated bitstreams to program the Xilinx ZC706 boards, the results of which can be seen in Table 3. General Timing Usage PSNR (db) Runtime Period (nsec) Max Latency BRAM DSP FF LUT Table 3. Hardware baseline results There is only a minor speedup when compared to the software baseline, approximately 1.03 times, but this is expected. Floating-point operations are slow in hardware and only 1 layer was moved into hardware. However, this gives us a good baseline for our hardware optimizations to come. Hardware Optimizations (Evaluation) After making sure we had a working hardware baseline, we began optimizing our hardware module for better performance. However, we set certain constraints to meet, namely to maintain a mean PSNR of at least 26 db and to keep resource utilization under 50%. The total resources of the ZC706 can be seen in Table 4. BRAM DSP FF LUT Table 4. Total resources available on ZC706 board

7 With the accuracy and area constraints in mind, the first optimization made after the hardware baseline design was the change to fixed-point arithmetic instead of floating-point arithmetic for the hardware convolution and biasing functions. Experimenting with different values, we discovered that we needed at least 4 integer bits to accurately represent the image data and at least 16 total bits to keep the PSNR above 26 db. Being only 16 bits wide, this halves the area required to store the weights, input, and output data. Due to its simplicity, it also provided a speed boost when looking at maximum latency of about 3 times. At this point, our total BRAM utilization was still well below the 50% limit, so we also moved the weights of layers 4 through 6 into hardware and rewrote the fsrcnn function to do all 4 layers in hardware. This only increased BRAM count (and negligible changes in FF and LUT usage due to select logic) and would provide an even better overall speed increase, though this isn t apparent by just looking at the Vivado synthesis reports. After deciding which fixed-point parameters to use, we next exploited inherent parallelism of the calculations. First, the innermost filter loops were manually unrolled. This way, the summing of the weights multiplied with the input values can be parallelized. We found that this also increases our PSNR as the accumulation happens in one step, reducing the accuracy lost of repeated accumulations. However, the PSNR increase was not enough to further drop the bit-width of our fixed-point type. We were also able to completely unroll the input fmap loop shown in Figure 3, as well as partition the input buffer completely in the first dimension to allow for simultaneous access, facilitating parallel calculation. Hardware Optimization Results (Evaluation) The results of the above-mentioned operations are shown below in Table 5. In terms of speedup of max latency, we achieved a 205 times decrease when compared to the baseline hardware design. For area, we utilized less BRAMs in the optimized design despite holding more data, but DSP, FF, and LUT did increase. They are, however, well below the 50% limit as they are 20.5%, 13.7%, 2.7%, and 10.1% respectively. While the program does have more parallelism than what we exploited, due to the sizes of our buffers, it was infeasible to further pipeline or partition beyond what we did due to the overhead of muxes resulting from variable array accesses. General Timing Usage PSNR (db) Runtime Period (nsec) Max Latency BRAM DSP FF LUT Table 5. Final optimized hardware results For timing, we achieved a total speedup of times over the software running on the ZC706 board. However, looking at the timing breakdown of the optimized design in Table 6 and comparing to that of the software in Table 2, our target layers (3-6) achieved a 46.7 times speedup. While this is far less than the 205 times decrease in maximum latency for the module, the bottleneck is likely the data transfer as well as the padding function, which is currently in software but is included in the runtime breakdown of the target layers. The results are still fairly significant, and when considering that when the runtime of layers 3-6 are eliminated completely, the speedup threshold is 1.673, we managed to come very close. L1-2 Runtime L3-6 Runtime L7 Runtime L8 Runtime Table 6. Final optimized runtime breakdown

For a visual representation of our final function output, Figure 4 shows a test image that we used to evaluate the output of our network. In the upper-right is the raw ground-truth image.

8 For a visual representation of our final function output, Figure 4 shows a test image that we used to evaluate the output of our network. In the upper-right is the raw ground-truth image. The left side of the image shows the LR input to the network which we obtained by taking the original ground-truth image, downsampling it by averaging neighboring pixels, then upsampling by inserting average-intensity pixels in between existing pixels to return the image to its original size. The lower-right image shows the HR output of our network after passing in the LR image. It is observed that the FSRCNN network restored some image quality to the LR image. The program writes three text files with the byte values tab delimited. These can be converted into image files by our Python script once they are copied to a local machine, as the ecelinux server does not have the required Python libraries. The script is in parse_image.py and needs to have the text files in the same directory. Project Management Figure 4: Actual output from running an image through the FSRCNN network Division of Labor Patrick Wang Shaurya Luthra Justin Selig MATLAB scripts to generate C++ convolution model MATLAB code to evaluate correctness of C++ model System framework Perform_conv and matrix printing functions, fsrcnn.cpp, main.cpp General system debugging and bug squashing Data accumulation and statistics Hardware optimizations: Fixed-point Line buffer (attempt, not successful) C++ debugging for convolution function First pass at PSNR and matrix printing functions (corrected by Patrick) Software -> Hardware perform_bias_hw perform_conv_h w SDSoC integration/ Makefile creation Data transfer Pragmas Hardware optimizations Loop reorder Loop unrolling Aided with Data Accumulation: Ran synthesis / Flashed FPGA Python.txt ->.PNG script Deconvolution function and testing Padding and convolution debug General system debugging Hardware optimizations Weight buffer (attempt, not successful) Attempt at hardware timing Project presentation setup, formatting, and editing Project writeup setup, formatting, and editing Writeup Sections

9 Week 1: 11/6/2017 Reference model research (All) C++ software port (Patrick and Justin) Week 2: 11/13/2017 Finished C++ software port (Patrick and Justin) PSNR (Shaurya) Debugging/verification (All) Week 3: 11/20 Begins SDSoC exploration (Shaurya) perform_conv_hw Perform_bias_hw SDSoC Makefile (Shaurya) Week 4: 11/27 Continued SDSoC exploration (Shaurya) Thanksgiving break (No work done) Week 5: 12/4/2017 Working SW baseline (with SDSoC Makefile) (Shaurya and Patrick) Working HW baseline (Shaurya) HW Optimizations (All) Results and data collection (All) Project development: Overall the project developed at a fairly steady pace and involved taking an incremental approach. We began by first writing software (C++) based off a MATLAB baseline that we found on the website of the authors who wrote the paper which served as our inspiration. In doing so we faced many challenges. The first challenge was translation. MATLAB has many magic functions (like PSNR), that we had to write ourselves. Furthemore, as MATLAB is 1 indexed and has the ability to randomly access array elements quite efficiently (something we cannot do in hardware), we had to spend quite a bit of time restructuring the convolution and deconvolution function for proper array access. Following this we began moving software over to hardware (layer 3). In doing so we saw many challenges in setting up SDSoC as well as in getting the data transfer setup so that our software would not hang on the actual FPGAs. We were eventually able to properly write our own makefile and set up the hardware functions so that we could run layers in hardware. After being able to run a basic layer in HW, we moved on to optimizations. We faced many challenges in regard to how we could restructure our program to minimize latency without overdoing resource utilization. In the end we were able to acheive all of our goals, and unload 4 layers into HW Conclusion The FSRCNN is a compute-intensive deep neural net which makes it an optimal candidate for hardware-acceleration. By implementing a high-level synthesis of the FSRCNN, we achieved a significant speedup over CPU for the inference phase of the network. In general, an implementation of inference in hardware is both power and time-efficient. Therefore, this type of model has many use-cases for image-processing in real-time applications or in power-constrained devices such as embedded processors. Moving forward, there are plenty of ways we might address some issues we had while developing this project. For one, the most efficient implementation would be that involving hardware entirely. If we obtained a large enough FPGA, we could eliminate the CPU-FPGA communication bottleneck by storing pre-trained weights in on-chip BRAM. Alternatively, large layers such as deconvolution could be broken up into multiple parts through a tiling process wherein convolution operations are performed on separate segments of an image and combined to form a single output. Similarly, there are algorithmic optimizations that we could implement to speed up computation such as using a line-buffer for convolution operations.

10 References [1] C. Dong, C. C. Loy and X. Tang, "Accelerating the Super-Resolution Convolutional Neural Network," [2] V. Alberto. An Example of a Convolutional Neural Network for Image Super-Resolution [3] V. Alberto. An Example of a Convolutional Neural Network for Image Super-Resolution--Tutorial utorial

Lab 4: Convolutional Neural Networks Due Friday, November 3, 2017, 11:59pm

ECE5775 High-Level Digital Design Automation, Fall 2017 School of Electrical Computer Engineering, Cornell University Lab 4: Convolutional Neural Networks Due Friday, November 3, 2017, 11:59pm 1 Introduction