ECE 5775 High-Level Digital Design Automation, Fall 2016 School of Electrical and Computer Engineering, Cornell University

Size: px

Start display at page:

Download "ECE 5775 High-Level Digital Design Automation, Fall 2016 School of Electrical and Computer Engineering, Cornell University"

Bartholomew Cameron
5 years ago
Views:

ECE 5775 High-Level Digital Design Automation, Fall 2016 School of Electrical and Computer Engineering, Cornell University Optical Flow on FPGA Ian Thompson (ijt5), Joseph Featherston (jgf82), Judy

1 ECE 5775 High-Level Digital Design Automation, Fall 2016 School of Electrical and Computer Engineering, Cornell University Optical Flow on FPGA Ian Thompson (ijt5), Joseph Featherston (jgf82), Judy Stephen (jls633) 1 Introduction Optical flow algorithms aim to estimate motion of objects using a stream of images. The technique is used in diverse applications from video encoding to visual odometry. It is widely used in robotics for obstacle avoidance and navigation. While the task requires a lot of computation power due to derivations and convolutions, it also requires a high frame-rate for real-time applications. This is because high accuracy object tracking requires a high update rate for any estimated velocity data. One of the more successful approaches to estimating optical flows the Lucas-Kanade method, and there have been many implementations of optical flow algorithms on FPGA deriving from the Lucas-Kanade method [1][2][3]. Our implementation will focus on one specific derivative of the Lucas-Kanade method which is well-optimized for FPGAs and reduces overall error [1]. In our project, we have successfully implemented such a design using Vivado HLS and SDSoC, and achieved an average error in flow direction of about 20 degrees. Our design is fully pipelined, such that it processes a pixel every cycle, to allow it to be used in real-time applications. 2 Techniques The technique used in this paper was optimized for use on an FPGA, in particular, it was divided into five distinct stages to allow pipelining. The input to the algorithm is a sequence of images. The sequence of images are of the same scene where the images were taken consecutively. In order to calculate the optical flow for one image, we needed a input sequence of four other images along with the desired image for which we want to calculate the T optical flow. The input sequence is described as g (X) where X = (x, y, t). The x and y are the spatial coordinates in the image while the t is the temporal coordinate of each image in the sequence. The input sequence is passed to the first module which calculates the image s gradient. Figure 1 shows the result of this gradient computation for our test image. Figure 1: Output of the gradient step for one frame. In this image, blue represents the X gradient, green Y, and red time The second step, gradient weighting, takes the gradients computed for each pixel by the first module, and applies a function to smooth each gradient, to reduce noise in the data. A 2D gaussian kernel is convolved with all 3 components of the gradient, to apply a small gaussian blur to the data. Figure 2 shows the smoothed image resulting from this step.

Figure 2: The smoothed output of the gradient weighting step The third step, outer product, computes the outer product of the smoothed gradient data.

2 Figure 2: The smoothed output of the gradient weighting step The third step, outer product, computes the outer product of the smoothed gradient data. This yields a 3x3 matrix O for each pixel, computed from the following equation. Note that while the matrix is 3x3, it only contains 6 distinct values: O = g(x) g(x) T = After computing the outer product, it is again smoothed using a gaussian blur, similar to the gradient weighting step, to produce the final gradient tensor T. This involves applying a gaussian blur to all 6 values produced in the outer product step. Figure 3 shows cropped images of the six different results from this step. T = c i O i = i

3 Figure 3: The six images produced after tensor weighting. The bottom row, from left to right, is t 1, t 2, and t 3. The top row, from left to right, is t 4, t 5, and t 6 The final step computes the flow vector based on the gradient tensor T. Optical flow can be represented by ( V x V y) T T measured in pixels per frame. It can then be extended to v = (V x V y 1), which is the 3D spatio-temporal vector. The idea behind the optical flow calculation is that a moving object with only translational motion and without noise would result in v T Tv = 0. However, all of the images taken would realistically have noise and rotation. Therefore the objective is to minimize v T Tv since it will not be zero. The x and y components, V x and V y, could be calculated based on the gradient tensor T: (t6t4 t5t 2) v x = (t t t ) 3 Implementation (t t t t ) v y = (t t t ) The optical flow estimation engine was implemented in C++ as a hardware/software codesign targeting the ZC706 board. All of the 5 main stages of the optical flow pipeline are implemented in hardware. The software on the ARM core is responsible for loading input images, converting them to grayscale images, passing them to the hardware, and reading the outputs from a memory buffer. The host program then saves this output buffer, and computes the overall error. Most real applications will use some sort of additional processing on top of the flow calculation, either implemented in hardware or in software. Since we don t have an additional processing, the bulk of computation will have been completed in the hardware accelerator. For our project, we simply display the estimated optical flow as a colored image. This image is generated from the output of our flow algorithm using open source code acquired with the MPI Sintel Dataset [4]. The accelerator design consists of a top level function and 8 submodules, implemented as individual functions. The overall datapath for the accelerator design is shown in Figure 4. There is one submodule for each

4 block in the figure with the exception that the two weighting steps are each split into and X and Y weighting. Additionally, the gradient calculation for the z direction is split into its own function. This is because the Z gradient is fundamentally different from the X and Y gradients, as it is a time domain gradient which requires 5 frames of data to compute. The X and Y gradients, on the other hand, only look at a single frame, but need line buffers to look in the neighborhood around a pixel. Each submodule function uses two dimensional arrays to pass inputs and outputs, which allows each stage to be written in terms of a transformation over a framebuffer, rather than a streaming algorithm. By using a dataflow optimization, these arrays are implemented by SDSoC as FIFO buffers which connect the stages together. This allows the entire design to be pipelined to an Initiation Interval (II) of 1 pixel/cycle if all the submodules achieve this II. Figure 4: Diagram of algorithm implementation The outer product and final optical flow calculation steps are very straightforward to pipeline, since they only involve looking at a single pixel and not any neighborhood. These functions are very simple because of this, and only require a simple pipeline directive and will automatically implement their inputs and outputs as FIFOs. The XY gradient calculation, gradient weighting, and tensor weighting steps, however, are all basically convolutional steps which require computing using a neighborhood of pixels around each individual pixel. For example, the gradient calculation has a kernel size of 5. This means that the pixels in a 5x5 grid must be accessed to compute the gradient for any pixel. In order to store this data, we use two data structures: a line buffer and a window. These data structures are classes are supplied by Vivado HLS and SDSoC. The line buffer is used to store entire rows of the image on chip using BRAM resources. This makes it possible to store all of the rows within the convolutional window in on-chip memory. However, the BRAMs only have enough read ports to permit one read per row per cycle, which means that a convolution cannot read all of its pixels directly from the line buffer. The window data structure is used to permit reads to every pixel within the window, and is filled with data using the line buffer. This window stores the local 5x5 values within the convolutional window and is implemented with registers so that all entries can be accessed within the same cycle. The process for the tensor and gradient weighting is similar to the gradient calculation. However, because the convolution is separable, the X and Y steps of the weighting are performed separately. This means each module will only require either a line buffer or a window. The Y weighting steps use line buffers, since they only require one read per row. These line buffers are sized at 7 rows for gradient weighting and 3 rows for tensor weighting. The X weighting steps use windows instead, since they read all data from a single row. These require storing 7 pixels worth of data for gradient weighting and 3 for tensor weighting. One challenge we encountered while implementing this pipeline was the connection between gradient calculation and gradient weighting. Since gradient calculation was split into two modules, both of which needed the pixel value from the same frame, a small block was added in front of gradient calculation to split the incoming image stream. However, the XY gradient calculation needed to consume 2 full rows of image data before it would begin producing any output data, due to it needing to perform a convolution. The Z gradient calculation had no such

5 restriction, and would begin outputting data immediately. This caused our pipeline to deadlock when we first loaded it onto hardware, since there was no additional buffering between Z gradient calculation and gradient weighting. Since gradient weighting pulled data from XY and Z at the same time, Z had to stall until XY began outputting data. This in turn stalled the distributor module in front of XY and Z, which then stalled XY, deadlocking the whole system. To fix this, we increased the size of the output fifo from the Z gradient calculation to be large enough to hold 3 full rows of pixel data, which would be enough to patch over the period of no data coming from the XY gradient calculation. 4 Evaluation To evaluate our design, we loaded it onto a Xilinx zc706 board, and observed the execution time for evaluating one frame of optical flow data, which involves reading 5 frames of input data. Our evaluation was primarily based around processing a sequence of x436 images from the Sintel dataset[4]. The 3rd frame of the evaluation sequence is shown in Figure 4. Accuracy was evaluated by comparing the results against ground truth data from the Sintel dataset. The ground truth for this sequence is shown in Figure 5. Accuracy is given as the mean absolute error between the angle reported by our design, and the angle of the ground truth data. Magnitude of the result was not directly considered for the purposes of computing accuracy, however we did use a threshold on magnitude to determine whether the algorithm had a valid output value. If the algorithm reported a flow vector with magnitude greater than this threshold (vel_x^2 + vel_y^2 > 25 in our implementation), the pixel was discarded and not considered for accuracy, since the algorithm reported a velocity faster than it could measure. These pixels were all on surfaces moving too fast in the input sequence to be tracked by our design. Figure 6 is a visualization of our algorithm s output, where discarded regions are shown in black. Figure 4 : Input image from the test sequence

Figure 5 : Ground truth image from the test sequence Figure 6 : Output of our design for the test sequence Freq BRAMs DSPs LUTs Flip-Flops Latency Pixel II 142.

The hardware resource usage is taken from the final synthesis report, which means it includes the resources used to interface our accelerator with the host processor.

6 Figure 5 : Ground truth image from the test sequence Figure 6 : Output of our design for the test sequence Freq BRAMs DSPs LUTs Flip-Flops Latency Pixel II MHz cyc 1 15% 50% 29% 19% Table 1 : Synthesis results for our design Table 1 summarizes our synthesized hardware design. The hardware resource usage is taken from the final synthesis report, which means it includes the resources used to interface our accelerator with the host processor. The clock frequency of the accelerator was chosen to match the frequency of the rest of the data motion network. Our DSP usage is fairly high due to the use of floating point numbers throughout the entire design. In theory, our design could be converted to use fixed-point numbers instead. Doing this should drastically reduce the DSP usage, and also reduce BRAM utilization by reducing the width of all the FIFOs and line buffers in the design. Our design s latency, however, would not be impacted too much by the change. Since our design is pipelined to an II of 1, the latency is

7 primarily controlled by the trip count through our longest loop.the longest loop in our design is in the XY gradient calculation stage, which iterates 2 additional steps beyond the extents of the x and y dimensions. This means the loop effectively processes a 1026x438 image, meaning it runs for iterations. Comparing this with the latency of the pipeline as a whole shows that the latency of a single pixel is only 430 cycles. While converting to floating point would likely decrease this single-pixel latency significantly, as a whole any improvements would be insignificant compared to the latency of the design operating on an entire image. Software Time Hardware Time Speedup 533ms 49ms 10.9x Table 2 : Performance comparison between software baseline and hardware implementation Table 2 is a summary of our hardware design s performance, compared to our software baseline. Since both designs used floating point numbers in their calculations, and both used the same algorithms, both came out to the exact same accuracy value of 20.1 degrees. Our hardware design shows a very significant speedup over the software implementation, reaching almost 11 times the speed of the software implementation. While this speedup may not seem significant immediately, it is important because it puts the algorithm in a reasonable performance range for real-time applications. When connected to a continuous source of pixels, the startup overhead of the host processor configuring and launching the accelerator would be amortized, allowing our design to process a video stream at full speed. Theoretically, assuming the memory system provides pixels fast enough, our design should be able to process this image in 3.2 ms. This implies that about 45ms of our hardware implementation s runtime is spent setting up the module and transferring data in and out. If this design were set up in a full video pipeline, this additional overhead would be removed, making the speedup over the software implementation much higher. 5 Project Management At the outset of this project, we had planned to have each person write a different stage of the algorithm. However, due to limitations in our schedules, we had to forego that plan, and instead focus on implementing components whenever people were available. We started our implementation by having all three of us do a literature search for FPGA implementations of optical flow algorithms. Once we settled on a paper and algorithm, our next target was to create a software implementation of the algorithm. Judy began this by working on a MATLAB implementation of the algorithm. However we later decided to switch to a C++ implementation so that it could be easily adapted for use with Vivado HLS. Joe authored the majority of this software implementation, and later focused on making it synthesizable. Ian and Judy assisted in debugging the implementation, and optimizing it to maximize accuracy. Ian focused primarily on integrating the HLS module into the rest of the system using SDSoC, and lead the effort to implement the design on real hardware. Alongside this, Judy and Joe explored converting our design to fixed point, however we were not able to get the error rate down to a reasonable number in time for this project. 6 Conclusion Our objective was to implement a real-time optical flow estimation algorithm on a FPGA. Although we did not implement a way to supply a continuous stream of images to the design, our design was able to compute optical flow in a single sequence of images, and did so in such a way that it could handle a real-time stream of images. Future work could extend this project to calculate optical flow in real time on a video streamed into the FPGA board. Our algorithm was divided into five key stages, where the output of one stage would be passed to the input of the following stage. A sequence of five input images would be sent to the first stage, which calculates the spatio-temporal gradient of the image. This gradient would then be smoothed in the second module using a gaussian blur kernel. The third stage computed the outer product of these gradients, which would itself then be smoothed in the fourth stage using another gaussian blur kernel. The tensor produced by blurring the outer product was used by the fifth and final stage to perform the actual flow calculation. Our implementation was able to achieve a 20 degree average error in flow heading, and was successfully run using a Xilinx ZC706 FPGA board. Our design was sufficiently pipelined to allow it to be used to process a video stream in real-time. While we used floating-point calculations in our design, a good addition to this project would be to adjust it to use fixed-point numbers, which

8 would significantly reduce resource consumption, and potentially allow it to run on a smaller board such as a ZC702. This project gave us a much deeper understanding of the intricacies involved in making a complex design work using high-level synthesis. Through it, we learned a lot about how to structure an algorithm such that it works well with high-level synthesis, and how to debug such designs when they don t work. We think that the course project could be improved by pushing more of the final project s experience into the labs. In particular, exposure to SDSoC before the project would be good, since it would give a better idea of what sort of interactions with the host processor are possible and efficient to implement. Additionally, it would be nice if we had more resources available to help debug cases where the synthesized design s results don t match with the csim results, or where the synthesized design simply deadlocks. A guide to common pitfalls and mistakes which can cause these mismatches would be particularly nice. That said, this project was a very valuable learning experience, which will be useful in future endeavors involving high-level synthesis. 7 Appendix Code Located at: 8 References [1] Z. Wei, D. Lee, B. Nelson, FPGA-Based Real-Time Optical Flow Algorithm Design and Implementation, Journal of Multimedia, Vol 2, No. 5, Sept [2] J. Diaz, E. Ross, F. Pelayo, E. Ortigosa, S. Mota, FPGA-Based Real-Time Optical-Flow System, IEEE Transactions on Circuits and Systems for Video Technology, Vol 16, No. 2, Feb [3] J. Porter, M. Thomson, A. Wahab, Lucas-Kanade Optical Flow Accelerator, MIT 6.375, Spring [4] Butler, D. J. and Wulff, J. and Stanley, G. B. and Black, M. J., A Naturalistic Open Source Movie for Optical Flow Evaluation, European Conf. on Computer Vision (ECCV), Part IV, LNCS 7577, pp , Oct 2012.

ECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University

ECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University Lab 4: Binarized Convolutional Neural Networks Due Wednesday, October 31, 2018, 11:59pm