Bridging the Gap between FPGAs and Multi-Processor Architectures: A Video Processing Perspective

Size: px

Start display at page:

Download "Bridging the Gap between FPGAs and Multi-Processor Architectures: A Video Processing Perspective"

Marilynn Palmer
6 years ago
Views:

1 Bridging the Gap between FPGAs and Multi-Processor Architectures: A Video Processing Perspective Ben Cope 1, Peter Y.K. Cheung 1 and Wayne Luk 2 1 Department of Electrical & Electronic Engineering, Imperial College London 2 Department of Computing, Imperial College London {benjamin.cope, p.cheung,w.luk}@imperial.ac.uk Abstract This work explores how the graphics processing unit (GPU) pipeline model can influence future multi-core architectures which include reconfigurable logic cores. The design challenges of implementing five algorithms on two field programmable gate arrays (FPGAs) and two GPUs are explained and performance results contrasted. Explored algorithm features include data dependence, flexible data reuse patterns and histogram generation. A customisable SystemC model, which demonstrates that features of the GPU pipeline can be transferred to a general multi-core architecture, is implemented. The customisations are: choice of processing unit (PU); processing pattern; and on-chip memory organisation. Example tradeoffs are: the choice of processing pattern for histogram equalisation; choice of number of PUs; and memory sizing for motion vector estimation. It is shown that a multi-core architecture can be optimised for video processing by combining a GPU pipeline with cores that support reconfigurable datapath operations. 1. Introduction Whilst dual-core CPUs are becoming commonplace, NVIDIA has released the GeForce 8800 GTX 128-core GPU. GPU developers overcome the design and programming issues of multi-core scalability through a constrained pipeline model. This puts the GPU in direct competition with the FPGA as an accelerator in a video processing system. This work explores how the pipeline model of the GPU can influence future multi-core video processing architectures to improve performance and reduce design effort. The core of this architecture can be a processor or FPGA logic. The contributions of this work are: 1) the presentation of FPGA and GPU design challenges and performance results for five algorithms; 2) a customisable multi-core SystemC model; 3) an exploration of model customisations; and 4) a summary of lessons for future multi-core architectures. The paper is organised as follows: section 2 outlines related literature; key GPU pipeline features are described in section 3; section 4 describes FPGA and GPU design challenges for five algorithms; FPGA and GPU performance results are analysed in section 5; section 6 explores the customisable model; and conclusions are detailed in section Related Work FPGA architectures can be used to exploit regular data access patterns and parallelism in video processing algorithms. Sonic-on-chip [1] is an example multi-core FPGA architecture which exploits these factors. One application of sonic-on-chip is 3-step non-full-search motion vector estimation (NFS-MVE). A four channel (core) implementation achieves a target throughput of 6.8 million pixels per second (MP/s) [1]. The GPU is another multi-core architecture capable of exploiting data access locality and parallelism in algorithms. Literature shows examples of one to two orders of magnitude performance improvements for the GPU over the CPU [2, 3, 4]. The work presented here combines the GPU pipeline with reconfigurable logic core flexibility. Previous work by the authors on comparing the FPGA and GPU focused on an algorithm s match to the GPU instruction set and the effect of changing arithmetic intensity [2]. The arithmetic intensity of an algorithm is the ratio of arithmetic operations to the number of memory accesses. The results in section 5 consider additionally algorithms which require data dependence, memory gather and flexible data reuse. Memory gather in this work requires an entire video frame to be used to produce a small number of output values, for example, to produce a cumulative histogram. These demonstrate higher level characteristics of the FPGA and GPU architectures. 3. GPU Architecture The GPU cores of the NVIDIA GeForce 6800 GT and 7800 GTX graphics platforms are considered in this work. Key GPU architecture features, relevant to this work, are explained below. The focus is on the fragment processing /07/$ IEEE 308

2 Algorithm Number of Memory Reuse Memory Access Arithmetic Accesses (per pixel) Potential Pattern Intensity Bi-cubic Interpolation 16 low-medium predictable low Histogram Equalisation (HE) bin Calculation [of cumulative histogram] 1 (gather) null predictable medium Application 1+1(lookup table) null predictable low 3-step non-full search motion ( ) high locally high vector estimation (NFS-MVE) [1] random Primary Colour Correction (PCCR) [2] 1 null predictable high 2D Convolution (size n n) [2] n 2 high predictable low Table 2. Characterisation of a representative set of video processing algorithms stage of the GPU pipeline. For further details and background on the entire GPU pipeline refer to [6]. The GPU pipeline exploits iteration level parallelism through multiple fragment processors organised in groups of four (known as quads). The number of quads varies with the GPU model as shown in Table 1. The GPU is scalable to multiple quads because of two constraints: no result sharing between fragment processors and a feed through pipeline. Without these constraints, multi-core scalability would be restricted due to synchronisation constraints. The feed through pipeline of the GPU ensures an external memory location can only be read from or written to during each processing pass. This means that some algorithms require a multi-pass implementation [3]. Two examples of this are described in section 4. A further feature which simplifies the GPU programming model, and promotes scalability, is that all fragment processors execute the same program code. This code is commonly referred to as a fragment shader. To support the above features of the GPU pipeline, an efficient memory hierarchy is required. The components of the hierarchy are off-chip memory, GPU caches and output buffering. The memory requirements of graphics rendering demand the large memory bandwidth between the GPU and off-chip memory shown in Table 1. A shared cache is used to exploit data reuse between off-chip memory and the fragment processors. Literature suggests a 4-way associative 16 KBytes cache [7]. 4. Benchmark Algorithms Table 2 characterises five algorithms which represent features common to video processing. Characteristics Model Fragment Memory Core (Release Year) Processors Bandwidth Clock GF 7800 GTX (2005) (6 quads) GBytes/sec MHz GF 6800 GT (2004) (4 quads) GBytes/sec MHz Table 1. Specifications of the NVIDIA GeForce 6800 GT and 7800 GTX [5] which present interesting design challenges for FPGA and GPU implementation are italicised and described below. Bi-cubic interpolation can be used to resize a video frame to an arbitrarily higher resolution. This flexibility is indicated by the low to medium reuse potential in Table 2. An FPGA architecture exploits data reuse through using onchip buffers to store rows of a video frame. This requires a complex heuristic to implement a flexible reuse strategy and thus perform arbitrary resizing. GPU implementation of flexible reuse potential is straightforward because this is a feature supported by its memory hierarchy. Histogram equalisation (HE) is a special case of memory access which requires memory gather. To put this another way HE calculation requires an entire video frame to be reduced to, in this example, a 256 bin cumulative histogram. This is typical of the 8-bit resolution of video data. A multi-pass GPU implementation is required. Four rendering passes are needed to calculate the histogram and a final pass for its application to a video frame to equalise the intensity spectrum. The FPGA implementation requires a ROM decoder coupled with 256 parallel accumulators for HE calculation and a lookup table for HE application. NFS-MVE is a three step algorithm which on each step, for one search window, computes nine (a 3 3 grid of) sum of absolute difference (SAD) calculations with overlapping windows in a reference frame. The reference frame window locations in steps 2 and 3 are dependent on a comparison of SAD results from previous steps. The access pattern for NFS-MVE is termed locally random because the memory access in steps 2 and 3 is data dependent within a spatially local area. For GPU implementation NFS- MVE also requires a multi-pass approach. Each of the three steps of NFS-MVE requires three GPU passes for a total of nine rendering passes. The FPGA implementation of NFS- MVE is taken from work by Sedcole [1]. Implementation highlights of this were mentioned in section 2. The primary colour correction and 2D convolution algorithms show examples of high arithmetic intensity and variable memory access requirements respectfully [2]. For full implementation details and code for all the algorithms in Table 2 please contact the primary author. 309

3 Throughput Millions of Pixels per second (MP/s) Performance of Benchmark Algorithms on the FPGA and GPU Virtex 2 (Right) Virtex 4 GF 6800 GT GF 7800 GTX (left) 0 A B C 5x5 9x9 11x11 Primary H.Eq Non FS Interpolation 2D Convolution CCR MVE Figure 1. Throughput of Benchmarks on FPGA and GPU. Interpolation variants are to (A), to (B) and to (C) 5. Benchmarking the FPGA and GPU Performance results for the FPGA and GPU are shown in Figure 1. The FPGA implementations are coded in VHDL and synthesized using the Xilinx ISE 8.2i design suite. Post place and route speeds are shown. GPU code is compiled using the NVIDIA Cg Compiler version 1.5. The GLSL- Program library, created at Imperial College London, is used to create a layer of abstraction between C++ code and the OpenGL API. CPU cycle counts are used to measure GPU throughput. All benchmarks are implemented by the primary author with the exception of NFS-MVE on the FPGA [1]. The primary colour correction and 2D convolution results are updates from work published elsewhere [2]. An interesting result in Figure 1 is that the GeForce 7800 GTX implementation of primary colour correction outperforms the Virtex 4. This is the opposite of the result shown in prior work [2] on the comparison of the then state of the art GeForce 6800 GT and Virtex 2. This demonstrates the rapid performance improvement of modern GPUs. For bi-cubic interpolation, the GeForce 7800 GTX shows superior performance of up to three times over the FPGA. GPUs are well suited to algorithms with arbitrary reuse patterns, over a small window size (4 4 for bi-cubic interpolation), because this mimics graphics rendering. The FPGA speed is limited by the complex heuristic required for reuse pattern control logic. This is exemplified further by comparing the throughput of the FPGA interpolation and 2D convolution (size 5 5) implementations. 2D convolution has higher throughput despite the larger window size. This is due to the complex data reuse control logic required to implement arbitrary bi-cubic interpolation on the FPGA. For histogram equalisation (HE) the FPGA outperforms the GPU by over three times. This is due to the five pass algorithm, described in section 4, required for GPU implementation. The four passes required for cumulative histogram generation take 92% of this execution time. This highlights a limitation of the GPU when the designer desires to gather data from an entire frame to, for example, generate a cumulative histogram. This is a result of the limitation that fragment processors cannot share computation results and additionally that each fragment processor cannot hold computation results between pixels which it processes. These limitations are explored further in section 6.4. Data dependence for NFS-MVE is considered by using two video test sources provided by the project sponsor. GPU performance results are indistinguishable for any degree of video motion. This is a product of the GPU s heritage of computer graphics rendering which requires high performance for locally random memory accesses. NFS-MVE appears slow on the FPGA because the architecture is targeted at a desired throughput rate rather than maximal performance [1]. The performance of the GPU implementation of NFS-MVE is also low relative to the other algorithms. This is because of the large memory access requirements shown in Table 2. A multi-pass GPU implementation is not always a problem. For HE a more intuitive implementation can be made on the FPGA. However, for NFS-MVE the GPU approach was shown to have the highest performance despite a ninepass implementation. The requirement to implement a multi-pass approach is an acceptable sacrifice for the design and scalability benefits of the fixed GPU pipeline. A further interesting benefit of the GPU pipeline is its support for implementing arbitrary data reuse patterns. 310

4 Off-chip Memory Model 10 Cycle Estimates for Benchmarks on Model Pattern Generation Global Memory I/O On-Chip Memory MMU Processing Group 0 PU m... MMU Processing Group n PU m Output Buffer Cycle Estimate (10 6 ) GeForce 6800 GT (m=4,n=4) (m=4,n=6) (m=4,n=8) Figure 2. A Multi-Core Model (PU = Processing Unit, MMU = Memory Management Unit) 6. A Multi-Core Architecture This section explores the application of the GPU pipeline to a customisable multi-core architecture. In section 5 the GPU architecture was shown to be beneficial for implementing algorithms which require arbitrary data reuse but a limitation if generating a cumulative histogram. It is desired to include a subset of GPU pipeline features in a model of the customisable multi-core architecture. Two GPU pipeline features, identified in section 2 as key to the model s scalability, are: a feed-through pipeline; and no data sharing between fragment processors. The model s customisable options are: number of PUs (scalability); onchip memory; processing pattern; and choice of processing unit (PU). Section 6.1 describes the model. Three algorithms are used to verify the model and show scalability in section 6.2. In section 6.3 on-chip memory choice is explored through NFS-MVE. HE is used to explore the effect of a controlled processing pattern in section 6.4. Section 6.5 shows options for choice of processing unit. General lessons for future video processing architectures are presented in section The Model and Initial Setup The IEEE SystemC class library is chosen as the implementation platform. The flexibility of SystemC allows a high-level customisable multi-core model to be created. Figure 2 shows a high-level diagram of the model. This is a modified and simplified version of the fragment processing stage of the GPU pipeline [6]. The model and its differences from the GPU pipeline are described below. The multi-core hierarchy has m processing units (PUs) implemented in each of the n processing groups. Each processing group processes a block of m m pixels concurrently. For m = 4 and n = 4 this is similar to the GeForce 6800 GT fragment processor set-up. The feed-through pipeline is formed as follows. The pattern generation module supplies each processing group with 0 A B C 5x5 9x9 11x11 PCCR Interpolation 2D Convolution Figure 3. Performance trends for Benchmarks Implemented on the Multi-core Model for a pixel frame size the order in which to process pixels. The memory hierarchy is restricted such that data can only be fetched from one memory location and written to another. Outputs from the processing group are combined in an output buffer. The pixel processing order is produced by the pattern generation module and can be set to be any arbitrary sequence. This is in contrast to the GPU where the pixel processing order is determined by the output of the vertex processing and rasterisation pipeline stages. Processing units (PUs) connect to memory through a local memory management unit (MMU). Each MMU accesses pixels from the on-chip memory for m PUs. If the pixels are not available in on-chip memory the MMU arbitrates through the global I/O for off-chip memory access. Concurrent memory accesses by PUs are handled by the MMU. For conflicts between MMUs, concurrent memory accesses are handled by the global I/O. The round robin scheme is used for arbitration. This is similar to the method used for the texture unit (MMU) and fragment processors (PUs) in GPUs [8]. The initial on-chip memory setup is a 4-way associative 16k cache with 64 pixel cache lines. For verification purposes the PU model is a simple linear model of the GPU fragment processor. This models the execution time and memory access pattern of an algorithm. This allows high-level architecture exploration whilst keeping data level parallelism within each PU fixed. The initial processing order is chosen to be an iterative z-pattern [9]. This is a commonly used memory addressing pattern which exploits the spacial locality of neighbouring pixels Model Verification and Scalability In this section the model described in section 6.1 is verified and its scalability demonstrated. This is achieved through implementing three algorithms from Table 2 on the model. Figure 3 shows results for m = 4, n = 4, 6, 8 and a 311

5 frame size of pixels. The trend for the GeForce 6800 GT performance is plotted alongside these results. It can be seen that the trend of the model with m, n = 4 and the GeForce 6800 GT are similar. The model with m, n = 4 has a higher cycle count than the GeForce 6800 GT for interpolation and 2D convolution. The reason is that the GPU fragment processors are highly multi-threaded to hide memory access latency. This is a limitation of the model which does not include multi-threading. However, the aim is to model high level architecture features of the GPU and not the benefits of multi-threading. Model variations with n = 6, 8 are shown to demonstrate scalability. Primary colour correction has high arithmetic intensity and therefore cycle count for n = 8 is almost half that for n = 4. The cycle count ratio for interpolation is 1.8 times due to the behaviour of on-chip memory (cache) with increasing n. As n increases the situation arises where one PU may remove a data item from cache when another PU needs to reuse it. 2D convolution also scales less than two times for a doubling of processing groups. This effect becomes more prominent as convolution kernel size increases On-Chip Memory Flexibility The NFS-MVE benchmark requires nine passes for implementation on the GPU. This nine-pass GPU method is used to implement NFS-MVE on the model for m = 4 and n = 1, 2, 4. The results are shown on row one of Table 3. NFS-MVE is a memory intensive algorithm therefore onchip memory (cache) size is key to performance. The results for two and four-fold increases in cache size, keeping all other factors constant, are shown. The linear processing unit model, from section 6.1, is maintained for all results. Cache Model Model Model Changes (size) (m=4,n=1) (m=4,n=2) (m=4,n=4) Original (16k) 4.8 (17k) 3.6 (36k) 2.1 (70k) 2 (32k) 4.6 (12k) 2.5 (17k) 1.4 (30k) 4 (64k) 4.3 (9k) 2.2 (15k) 1.1 (16k) Table 3. Model performance in clock cycles 10 6 and (off-chip memory reads) for nine pass NFS-MVE over a video frame Table 3 demonstrates a key feature of the shared memory structure of the architecture of Figure 2. That is the effect the processing units have on each other by fetching new pixels to the cache. For memory intensive algorithms increasing the number of PUs by a factor of two will not achieve a two-fold performance improvement. In some cases this can be much less. This is demonstrated through the differences in model performance of varying n above. As memory size is increased the scaling of performance improvement with n also increases. This is because each processing unit now removes less data from cache that other PUs require. The number of memory reads to external memory verify these observations by reducing proportional to performance improvement between changes in cache size A Controlled Processing Pattern The benchmark implementations above have been implemented in the same manner as for the GPU. A benefit of the controllable processing pattern for the multi-core model is now shown. The performance of histogram equalisation (HE) on the GPU was shown in Figure 1 to be inferior to the FPGA. This is due to the GPU s inflexibility in implementing memory gather. An improved performance can be achieved on the model in Figure 2, over the GPU, because a PU can hold computation results between processed pixels. This is possible because the processing order is controllable. Consider distributing the histogram calculation across multiple PUs in Figure 2. For n, m = 4 this equates to 16 accumulators in each processing unit to generate the 256 bin cumulative histogram. Each PU must now access the entire frame of pixels for the full histogram to be computed. This can be achieved through correct choice of processing order. The result for HE calculation is shown in Table 4. Table 4 also shows the implementation of the original fourpass GPU technique on the model for comparison. For a fair comparison the linear PU model of section 6.2 is maintained for both model implementations. Therefore the improved method performs the accumulations in series. Model GF 6800 GT Original (four pass) Improvement (single pass) Table 4. Model performance in clock cycles 10 7 for HE calculation over a frame To verify the improved model implementation a frame sixteen times larger than the target frame size is rendered on the GeForce 6800 GT. This is used to model memory access and performance by modelling the compute latency of 16 accumulators on the GPU. The result is shown in Table 4. The improved implementation on the multi-core architecture provides a two-fold improvement over the original method. This is achieved by removing the multiple passes of the original HE design, keeping all other factors constant Processing Unit (PU) Choice The above analysis uses a fixed PU model to maintain an equal level of parallelism for each PU. Different PU options are now discussed through comparing the choice of a fragment processor PU with a reconfigurable datapath PU. A fragment processor can perform at peak eight (two 4-vector) FLOPs per cycle [6]. It exploits no inter-pixel data reuse and has a high core clock rate as shown in Table 1. For reconfigurable datapaths all these features are variables. The resource usages for four example FPGA 312

6 (reconfigurable datapath) implementations, taken from section 5, are shown in Table 5. The number of BRAMs indicates the degree of data reuse. Slice usage indicates the amount of parallelism. The clock rates for the FPGA implementations are lower than a fragment processor (clock rate equals FPGA throughput in Figure 1). The first observation is a tradeoff between the high clock rate and fixed parallelism of a fragment processor against the arbitrary data reuse and parallelism of a reconfigurable datapath. These are characteristics of a hardware-software tradeoff. Benchmark Slices BRAMs Bicubic Interpolation 2.7k 24 2D Convolution k 12 2D Convolution k 24 2D Convolution k 30 Primary Colour Correction 3.6k 0 Histogram Equalisation (HE) 10.6k 0 Table 5. Slice count and Block RAM usage for Benchmark Algorithms on the FPGA Fragment processors are designed to be scalable as outlined in section 3. For a reconfigurable datapath PU scalability depends on whether and how data reuse is exploited. For implementations of primary colour correction and HE on a reconfigurable datapath, no data reuse is exploited. This makes them inherently scalable. For 2D convolution and interpolation the data reuse strategy must be considered. For scalability to be possible, subsequent frames (or regions of a frame) must be assigned to different PUs. This can be implemented through controlling the processing pattern. The design effort to implement the interpolation and 2D convolution algorithms in a reconfigurable datapath must be traded against their data reuse and parallelism benefits. The requirement for a complex heuristic to implement flexible data reuse in a reconfigurable datapath makes a fragment processor PU the suitable option for interpolation. The extra parallelism available from a reconfigurable datapath makes it the most desirable for HE. For 2D convolution a regular data reuse pattern can be exploited which also makes the reconfigurable datapath PU the superior option Generalisations from Model Changes It has been shown that a subset of GPU pipeline features can be applied to a customisable multi-core model. A feedthrough pipeline and no data sharing between processing units (PUs) are features key to multi-core scalability. Customisations of number of PUs, memory organisation and processing pattern are necessary variables for a multi-core architecture. The generality of the multi-core architecture means that the option of a reconfigurable datapath PU can be considered. The choice between a fragment processor and reconfigurable datapath PU is a tradeoff between data reuse (which influences scalability) and parallelism. 7. Conclusion Five video processing algorithms have been implemented on the FPGA and GPU. Algorithms which require a regular data reuse pattern perform well on the FPGA. For irregular reuse patterns the GPU outperforms the FPGA. Histogram equalisation (HE) and NFS-MVE are two algorithms which require multi-pass GPU implementations. Although the GPU performance is 50 MP/s, an FPGA implementation of HE outperforms the GPU by three times. A customisable multi-core model based on the GPU pipeline has been implemented using the SystemC library. Features of the GPU pipeline and model customisations are explored through the model. A feed-through pipeline and no data sharing between PUs are two GPU pipeline features which are key to multi-core scalability. It is concluded that a multi-core architecture can be optimised for video processing by combining a GPU pipeline with cores that support reconfigurable datapath operations. Future work will involve exploration of other multi-core architecture features such as those of the STI Cell Broadband Engine and GeForce 8800 GTX. We gratefully acknowledge the support provided by the UK Engineering and Physical Sciences Research Council (EP/C549481/1), and Sony Broadcast & Professional Europe, and Jay Cornwall and Lee Howes from Imperial College London for providing GPU programming advice and the GLSLProgram class library. References [1] P. Sedcole. Reconfigurable platform-based design in FPGAs for video image processing. In PhD Thesis, University of London, pp , Jan [2] B. Cope, P.Y.K. Cheung, W. Luk, and S. Witt. Have GPUs made FPGAs redundant in the field of video processing? In Proc. Field Programmable Technology, pp , Dec [3] J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Krger, A. E. Lefohn, and T. J. Purcell. A survey of general-purpose computation on graphics hardware. In Eurographics 2005, State of the Art Reports, pp , Aug [4] P. Colantoni, N. Boukala, and J. D. Rugna. Fast and accurate color image processing using 3D graphics cards. In Proc. Vision, Modeling and Visualization, pp , [5] NVIDIA Corporation [6] M. Pharr and R. Fernando. GPU Gems 2. Addison Wesley, [7] V. Moya, C. Golzalez, J. Roca, and A. Fernandez. Shader performance analysis on a modern GPU architecture. In 38th Annual IEEE/ACM International Symposium on Microarchitecture, pp , [8] J. Lindholm, J. Nickolls, S. Moy, and B. Coon. Register based queuing for texture requests. United States Patent No. US 7,027,062 B2, [9] C. Priem, G. Solanki, and D. Kirk. Texture cache for a computer graphics accelerator. United States Patent No. US 7,136,068 B1,

Journal of Universal Computer Science, vol. 14, no. 14 (2008), submitted: 30/9/07, accepted: 30/4/08, appeared: 28/7/08 J.

Journal of Universal Computer Science, vol. 14, no. 14 (2008), 2416-2427 submitted: 30/9/07, accepted: 30/4/08, appeared: 28/7/08 J.UCS Tabu Search on GPU Adam Janiak (Institute of Computer Engineering