A Pipelined Fast 2D-DCT Accelerator for FPGA-based SoCs

A Pipelined Fast 2D-DCT Accelerator for FPGA-based SoCs Antonino Tumeo, Matteo Monchiero, Gianluca Palermo, Fabrizio Ferrandi, Donatella Sciuto Politecnico di Milano, Dipartimento di Elettronica e Informazione Via Ponzio 34/5 20133 Milano, Italy E-mails: {tumeo, monchier, gpalermo, ferrandi, sciuto}@elet.polimi.it Abstract Multimedia applications, and in particular the encoding and decoding of standard image and video formats, are usually a typical target for Systemson-Chip (SoC). The bi-dimensional Discrete Cosine Transformation (2D-DCT) is a commonly used frequency transformation in graphic compression algorithms. Many hardware implementations, adopting disparate algorithms, have been proposed for Field Programmable Gate Arrays (FPGA). These designs focus either on performance or area, and often do not succeed in balancing the two aspects. In this paper, we present a design of a fast 2D- DCT hardware accelerator for a FPGA-based SoC. This accelerator makes use of a single seven stages 1D-DCT pipeline able to alternate computation for the even and odd coefficients in every cycle. In addition, it uses special memories to perform the transpose operations. Our hardware takes 80 clock cycles at 107MHz to generate a complete 8x8 2D DCT, from the writing of the first input sample to the reading of the last result (including the overhead of the interface logic). We show that this architecture provides optimal performance/area ratio with respect to several alternative designs. 1 Introduction Reconfigurable platforms have recently emerged to be an important alternative to ASIC design, featuring a significant flexibility and time-to-market improvement with respect to the conventional digital design flow [1]. In this context, several toolchains for the design and prototyping of Systems-on-Chip (SoC) have been presented [2, 3]. These tools permit to rapidly create systems composed of hard and soft core processors and a set of standard IP-cores, to interface with internal and external peripherals. In addition, the system can be tailored to the target application by including ad hoc coprocessors to properly accelerate the critical kernels. This paper presents a novel hardware architecture for a fast 2D Discrete Cosine Transform accelerator. The basic idea is to exploit the symmetries of the algorithm to save some area, but still ensure highperformance. The architecture is targeted to work as a hardware accelerator for the Xilinx MicroBlaze soft core processor, and builds on the specifications of the connection with the processor to further optimize its operations. This design is a component of a complete HW/SW implementation of the JPEG encoding algorithm. The 2D-DCT is one of the most computationally intensive phase of the encoding process, and its acceleration noticeably reduces the whole execution time of the application. The structure of this paper is the following. Section 2 discusses some related works. The 2D-DCT and on the Fast DCT algorithm are briefly discussed in Section 3. The proposed architecture is described in Section 4. Results are discussed in Section 5. Finally, Section 6 concludes the paper. 2 Related Work Several works proposing the architecture and highlevel design of a 2D-DCT cores have appeared. Xil-

F (u, v) = Λ(u)Λ(v) 4 7 7 cos[ i=0 j=0 (2i + 1)uπ ] cos[ 16 (2j + 1)vπ ]f(i, j) (1) 16 Λ(k) = { 1 2 if k = 0 1 else (2) Figure 1. Equations for the 2D-DCT inx [4] and Altera [5] offers, in their libraries, specific cores, optimized for their programmable devices in terms of occupation. Nevertheless, they feature relatively low performance and, furthermore, they are not so easy to integrate in System-on-Chip designs realized with their own toolchains. Many custom designs for FPGA have also been presented. Among them, Trainor et al. [6] propose an architecture with distributed arithmetic that exploits parallelism and pipelining. Agostini et al. [7] propose a 2D-DCT architecture based on the previous work of Kovac et al. [8]. The authors decompose the transform in two 1D-DCT calculations with a transpose buffer thanks to the separability property. This design is based on the Fast DCT algorithm. It uses a six stages Wallace tree multiplier, that decomposes the multiplier in shift and add operations. Nevertheless, since nowadays multipliers are embedded in FPGA, this approach is no more effective in order to reduce occupation. The 2D DCT global latency is 160 clock cycles and a complete 8x8 matrix is processed in 64 clock cycles. Our proposal is loosely inspired to this work. Nevertheless, we propose several optimizations that achieve important advantages in terms of area and performance. In addition, Agostini s design is conceived for a fully HW implementation of the JPEG encoder. On the other hand, our work targets a mixed HW/SW design, stressing the role of the interfaces to/from the processor. Yusof et al. [9], present a similar DCT architecture, integrated in a complex SoC targeted at image encoding. Finally, Bukhari et al. [10] present an architecture that implements a modified Loeffler algorithm (resulting in a faster but significantly larger implementation w.r.t. our proposal). In addition, the authors show how the occupation of the accelerators can greatly vary when implemented on FPGAs from different vendors. 3 2D-DCT Overview The DCT is a frequency transformation commonly adopted for compression algorithms, that concentrates the most information in a few low frequency coefficients. Slightly different definitions of the transform exist. Nevertheless, the bi-dimensional version, in the mostly used form, for 8x8 input samples block is shown in Figure 1. This equation has a high computational complexity. For instance, a 8x8 block requires 4096 multiplications and 4096 additions. Many optimizations have been proposed and, among them, in the field of image compression algorithms, the Fast DCT has been widely adopted. According to the Fast DCT algorithm, since the cosines depend only on the position in the 8x8 block of the samples, their values can be precomputed and the transform can be rewritten as a matrix multiplication, where the last matrix is the transpose of the first: T = CxC where C is the matrix of the values of the cosines. In addition, since the 2D-DCT is a separable operation, it can be computed by applying a 1D-DCT in one dimension (row-wise) and then by applying another 1D-DCT to the results in the other dimension (column-wise). This decomposition reduces the complexity of the calculation by a factor of four. Applying both the 1D decomposition and the Fast DCT algorithm, only 80 multiplications and 464 additions are needed to compute a 2D-DCT of a 8x8 block, where each 1D-DCT on a vector of 8 elements requires 29 sums or subtractions and 5 multiplications. It is important to stress that the result of the Fast DCT algorithm is scaled, so for example for the JPEG algorithm, it gets corrected in the quantization phase, where it can be performed in one step with the quantization itself.

4 Architecture The decomposition in two 1D computations leads to an architecture composed of two 1D pipelined architectures, and an intermediate buffer for the transposition, as proposed in [7]. Nevertheless, this solution is not area efficient, since each 1D pipeline performs exactly the same operations. In addition, to allow the use a global 2D-DCT pipeline, a special transpose buffer must be designed, since the first DCT produces row results, and the second DCT needs column values as input. This memory should have ping pong 1 features to permit to the first 1D architecture to write different values that could be read by the second 1D architecture. This leads to even more space occupation on FPGA. In particular, if the latency is critical, these memories cannot be implemented with internal BRAMs and they should be implemented as registers, which takes a lot of logic cells. The solution proposed in [7], which uses BRAMs, takes a latency of 64 cycles to generate a full transpose matrix. Also BRAMs can become a limiting factor, in particular if the 2D-DCT architectures needs to be integrated in a System-on-Chip with soft core processors, that needs the BRAMs as fast data and instruction memories. Our architecture has been designed considering the fact that the resulting accelerator should be connected to a soft core processor, the MicroBlaze [11] from Xilinx. Our DCT module should be part of a complete System-on-Chip to perform image encoding. The MicroBlaze, thanks to the Fast Simplex Links (FSL) [12], permits to connect application specific hardware accelerators using a point-to-point communication protocol via master slave ports. Each communication primitive can transmit 32 bits from the register file of the MicroBlaze to the accelerator and vice versa. Since the values of the input samples in image compression are constrained in a range covered with 8 bits, a single FSL command can transmit up to 4 values per cycle. Next section provides more details on the architecture implementation. 1 We say a ping pong memory, a memory interposed between two blocks (A and B) that can alternatively be written by A and read by B or be written by B and read by A. Figure 2. The 2D-DCT architecture with a single 1D-DCT component 4.1 Implementation We decided to implement an architecture that uses a single 1D-DCT pipeline, fed by a master FSL port, and a transpose memory that, as soon as the first monodimensional transformation has been completed, feeds back the transposed results to the same pipeline. Removing the option for a 2D-DCT global pipeline (like in [7]), we could implement this memory as a simple memory that gets written in rows and gets read from its columns. Then, the second 1D-DCT is performed, and the final results are stored in a secondary buffer before being transposed again and output to the slave FSL. Figure 2 shows an overview of the architecture. As explained before, a single pipeline would require the execution of 29 sums/subs and 5 multiplications. Observing that odd and even coefficients of the resulting 8 samples transformed vector requires different types of computations, we organized the pipeline in seven stages. In this way, we reduced the number of adder/subtractors to 19 and the number of multipliers to 4. This means that the pipeline alternates the needed values, each cycle, to compute the odd and the even coefficients of the resulting vector. The organization of our seven stages pipeline is shown in Figure 3. The FSL connection can feed four 8 bits values per cycle, and all the input samples are needed (8 values) for both the odd and even output samples. For these reasons, we implemented a pseudoping pong buffer (now at the input) partitioned in two parts of four values, in order to maintain the same values for two consecutive clock cycles. It is also important to stress that the DCT extends the range of the output values. Thus, the initial 8 bits values become, at the end of a 1D-DCT, values that

Figure 3. The seven stages of the 1D-DCT pipeline, with 19 adders/subtractors and 4 multipliers. Notice that latches between each stage are not drawn to show how the different functional units are connected are valid on 16 bits. But, in order not to lose precision, when doing multiple passes performing a 2D-DCT, it is important to represent the intermediate results between the first and the second 1D-DCT in a fixed point format, with at least 24 bits (8 bits for the decimal part). Our 1D-DCT pipeline accounts for this. Each computation is performed at 24 bits precision, and the transpose memory allows to save 24 bits values. The final results buffer saves, instead, only the integer part of the numbers in 16 bits format. Therefore, effectively, the output rate of the complete 2D-DCT is two 16 bits values per clock cycle. 4.2 Interfaces The input logic starts receiving data from the processor master port, feeds the ping pong buffer, and the pipeline, as soon as the first group of four samples is available. The output logic waits that the full 8x8 block has completed the two 1D phases and the result has been stored to the memory. Then, it starts sending results, grouped as two 16 bits values each, to the processor. The MicroBlaze, which, after sending the input samples, is waiting for a block to receive (MicroBlaze block read), finally starts reading the results. Resource Used Available Utilization Slices 2823 13696 20% Slice Flip Flop 3431 27392 12% 4 input LUTs 2618 27392 9% Table 1. Resource utilization of the Optimized Fast 2D-DCT hardware accelerator on the Xilinx XC2VP30 FPGA Starting from the loading of the first group of four input samples, to the reading of the last group of two results, the IP core takes 80 cycles. 48 cycles are used to manage the interfaces and the ping pong buffer, while 32 cycles are used for effective computation. 5 Evaluation In Table 1, we show the occupation of our 2D-DCT accelerator on a Xilinx XC2VP30 Device. With Xilinx ISE 8.2 our IP Core is synthesized at 107 MHz. Compared to the Xilinx [4] solution, our core has an occupation around 2.5 times higher, but the Xilinx IP core does not include input and output logic for a standard bus and it is much slower since it has an initial latency of 92 cycles and then produces just one

sample every cycle. This is due to the fact that it is realized combining 8 FIR filters to produce a single sample. Also, the area values refer to a standard, notcustomized core, and so they are relative to a 8 bits input and 9 bits output range, clearly not ready for JPEG encoding. Compared to Agostini s [7] architecture, which uses full Fast 1D-DCT components, our solution uses less multipliers and adders/subtractors just adding a single pipeline stage (six compared to seven). In addition, they adopt a solution with two 1D-DCT elements, while our IP core has one that get reused. They try to use less area implementing the multiplications using a Wallace tree, but since new FPGAs have embedded multipliers this is no more an interesting solution. In addition, this can lead to more occupation. Moreover, each stage of the pipeline needs eight clock cycles to be completed, so the initial latency is 48 cycles for a single 1D-DCT. The transpose memory requires 64 cycles more to complete the transpose operation, which leads to a global latency of 160 cycles. After filling the pipeline, however, each 8x8 blocks comes out at a full 64 cycles rate. Finally, Bukhari [10] IP core uses less adders/subtractors but many more multipliers (11) than our solution for a single 1D DCT element, due to the adoption of the Loeffler algorithm. A single 1D DCT is computed for 8 input samples in a single clock cycle, so the full 2D-DCT needs 16 cycles to be completed. The complexity of each stage of the core anyway does not allow more than 54 MHz in synthesis, and the area occupied, without the logic to interface to a standard processor bus, is already higher. Figure 4 shows the area/delay scatter plot for the four solutions, normalized with respect to the standard Xilinx IP Core. It can be seen that the Xilinx solution, our Optimized Fast 2D-DCT architecture, and Bukhari s solution are Pareto-optimal, lying on the same constant area/delay curve. Nevertheless, our proposal well balance area and delay, unlike Xilinx and Bukhari s solution. Agostini s architecture, which uses an organization similar to ours, features larger delay and area. Our work effectively optimizes this architecture for both area and delay. Table 2 reports the results obtained by executing the full JPEG encoding algorithm (including the reading Figure 4. Area/Delay comparison of the Four IP Cores of the input file and the saving of the output) on a two different architectures for a 160x120 pixels image. The first solution executes the encoding completely in software, and it is easy to see that the DCT calculation, performed with a Fast DCT software implementation, accounts for almost 20% of the application. The second architecture uses instead our Optimized 2D-DCT core to execute the transform. The numbers show that the 2D-DCT hardware accelerator is two orders of magnitude faster than the software implementation, giving a speed up of 138.4. It is also interesting to note that with the MicroBlaze architecture and the JPEG implementation adopted, the DCT phase is the second most computationally intensive phase of the algorithm. Since this work focuses only on the 2D-DCT hardware accelerator implementation, we did not optimize the RGB to YUV phase. The inclusion of the IP core nullifies the weight of the DCT phase in the application, giving a global speed up of 1.2. 6 Conclusions In this paper we presented a novel architecture for the Fast 2D-DCT algorithm. The proposed solution is optimized from the area/performance point of view. It uses the symmetries of the algorithm to minimize the number of functional units. Furthermore, the core has been designed to act as an Application Specific IP for the MicroBlaze soft core processor, and taking into account the features and the limitations of its communication system, the architecture has been even more

Phase Full SW HW/SW File reading 133,375,241 137,566,414 RGB to YUV 1,575,687,380 1,586,965,423 Exp & Downsample 2,013,185 2,013,435 Set quant. table 74,711 98,242 DCT 585,084,357 4,227,699 Quantization 354,084,692 339,500,870 Entropic coding 461,738,243 465,292,474 Total 3,112,057,809 2,535,664,559 Table 2. Comparison, in clock cycles, of the JPEG algorithm executed on a MicroBlaze architecture with and without the Optimized Fast 2D-DCT hardware accelerator optimized. Our Fast 2D-DCT hardware accelerator adopts a single 1D-DCT element with a seven stage pipeline, that encompasses 19 adders/subtractors and 4 multipliers. Compared to other designs in literature, it satisfies the requirements of low occupation without sacrificing performance. When introduced in a complete System-on-Chip architecture, it executes two orders of magnitude faster than a software implementation. Overall, it can make the execution of the full JPEG encoding algorithm 20% faster on a standard MicroBlaze system with reduced impact on occupation. References [1] Frank Vahid. The softening of hardware. Computer, 36(4):27 34, 2003. [2] Altera system-on-a-programmable-chip (SOPC) Builder. Altera Corporation. [3] Xilinx embedded developer kit (EDK). Xilinx Corporation. [4] Xilinx xapp610 video compression using dct, application note. xilinx corporation, available at http://www.xilinx.com. [5] Altera Megacore Digital Library, Altera Corporation. Workshop on, pages 541 550, Leicester, UK, November 1997. [7] L.V. Agostini, I.S. Silva, and S. Bampi. Pipelined fast 2d DCT architecture for JPEG image compression. In Integrated Circuits and Systems Design, 2001, 14th Symposium on., pages 226 231, Pirenopolis, Brazil, 2001. [8] M. Kovac and N. Ranganathan. JAGUAR: a fully pipelined VLSI architecture for JPEG imagecompression standard. Proceedings of the IEEE, 83(2):247 258, February 1995. [9] Z.M. Yusof, Z. Aspar, and I. Suleiman. Field programmable gate array (FPGA) based baseline JPEG decoder. In TENCON 2000. Proceedings, volume 3, pages 218 220, Kuala Lumpur, Malaysia, 2000. [10] K. Z. Bukhari, G.K. Kuzmanov, and S. Vassiliadis. Dct and idct implementations on different fpga technologies. In Proceedings of ProRISC 2002, pages 232 235, November 2002. [11] MicroBlaze Processor Reference Guide. Xilinx Corporation. [12] Fast Simplex Link (FSL) Bus (v2.00a). Reference Guide. Xilinx Corporation. [6] D.W. Trainor, J.P. Heron, and R.F. Woods. Implementation of the 2d DCT using a Xilinx XC6264 FPGA. In Signal Processing Systems, 1997. SIPS 97 - Design and Implementation., 1997 IEEE