DISCRETE COSINE TRANSFORM (DCT) is a widely

Size: px

Start display at page:

Download "DISCRETE COSINE TRANSFORM (DCT) is a widely"

Magdalene Matthews
6 years ago
Views:

1 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL 20, NO 4, APRIL A High Performance Video Transform Engine by Using Space-Time Scheduling Strategy Yuan-Ho Chen, Student Member, IEEE, and Tsin-Yuan Chang, Member, IEEE Abstract In this paper, a spatial and time scheduling strategy, called the space-time scheduling (STS) strategy, that achieves high image resolutions in real-time systems is proposed The proposed spatial scheduling strategy includes the ability to choose the distributed arithmetic (DA)-precision bit length, a hardware sharing architecture that reduces the hardware cost, and the proposed time scheduling strategy arranges different dimensional computations in that it can calculate first-dimensional and second-dimensional transformations simultaneously in single 1-D discrete cosine transform (DCT) core to reach a hardware utilization of 100% The DA-precision bit length is chosen as 9 bits instead of the traditional 12 bits based on test image simulations In addition, the proposed hardware sharing architecture employs a binary signed-digit DA architecture that enables the arithmetic resources to be shared during the four time slots For this reason, the proposed 2-D DCT core achieves high accuracy with a small area and a high throughput rate and is verified using a TSMC 018- m 1P6M CMOS process chip implementation Measurement results show that the core has a latency of 84 clock cycles with a 52 db peak-signal-to-noise-ratio and is operated at 167 MHz with 158 K gate counts Index Terms Binary signed-digit (BSD), discrete cosine transform (DCT), distributed arithmetic (DA)-based, space-time scheduling (STS) I INTRODUCTION DISCRETE COSINE TRANSFORM (DCT) is a widely used transform engine for image and video compression applications [1] In recent years, the development of visual media has been progressed towards high-resolution specifications, such as high definition television (HDTV) Consequently, a high-accuracy and high-throughput rate component is needed to meet future specifications In addition, in order to reduce the manufacturing costs of the integrated circuit (IC), a low hardware cost design is also required Therefore, a high performance video transform engine that utilizes high accuracy, a small area, and a high-throughput rate is desired for VLSI designs The 2-D DCT core design has often been implemented using either direct [2] [4] or indirect [5] [13] methods The direct Manuscript received February 22, 2010; revised August 09, 2010 and December 30, 2010; accepted January 24, 2011 Date of publication February 28, 2011; date of current version March 12, 2012 This work was supported in part by the Chip implementation Center and National Science Council under project number CIC T18-98C-12a and NSC E , respectively The authors are with the Department of Electrical Engineering, National Tsing Hua University, Hsinchu 30013, Taiwan ( yhchen@larceenthu edutw) Color versions of one or more of the figures in this paper are available online at Digital Object Identifier /TVLSI methods include fast algorithms that reduce the computation complexity by mapping and rotating the DCT transform into a complex number [3] However, the structure of the DCT is not as regular as that of the fast Fourier transform (FFT) A regular 2-D DCT core using the direct method that derives the 2-D shifted FFT (SFFT) from the 2-D DCT algorithm format, and shares the hardware in FFT/IFFT/2-D DCT computation is implemented in [4] On the other hand, the 2-D DCT cores using the indirect method are implemented based on transpose memory and have the following two structures: 1) two 1-D DCT cores and one transpose memory (TMEM) and 2) a single 1-D DCT core and one TMEM In the first structure, the 2-D DCT core has a high throughput rate, because the two 1-D DCT cores compute the transformation simultaneously [5] [9] In order to reduce the area overhead, a single 1-D DCT core is applied in the second structure [10] [13], thereby saving hardware costs Hsia et al [10] present a 2-D DCT with a single 1-D DCT core and one custom transposed memory that can transpose data sequences using serial-input and parallel-output and requires only a small area, but the speed is reduced because the single DCT core performs the first-dimensional (1st-D) and second-dimensional (2nd-D) transformations at different times Therefore, Guo et al use two parallel paths to deal with the reduction of throughput rate in the second structure [11] However, the additional input buffers are needed in order to temporarily store the input data during the 2nd-D transformation Madisetti et al present a hardware sharing technique to implement the 2-D DCT core using a single 1-D DCT core [13] The 1-D DCT core can calculate the 1st-D and 2nd-D DCT computations simultaneously, and the throughput achieves 100 Mpels/s However, multiplier-based processing element requires a large amount of circuit area in [13] As a result, it is obvious that a tradeoff is required between the hardware cost and the speed Tumeo et al find a balance point between the area required and the speed for 2-D DCT designs [14] and present a multiplier-based pipeline fast 2-D DCT accelerator implemented using a field-programmable gate arrays (FPGA) platform [14] Additionally, several 2-D DCT designs are implemented based on FPGA for fast verification [14] [17] In this paper, an D DCT core that consists of a single 1-D DCT core and one TMEM is proposed using a strategy known as space-time scheduling (STS) Due to the accuracy simulations in DA-based binary signed-digit (BSD) expression, 9-bit DA precision is chosen in order to meet the requirements of the peak-signal-to-noise-ratio (PSNR) outlined in previous works [5] [7] Furthermore, the proposed DCT core is designed based on a hardware sharing architecture with BSD DA-based computation so as to reduce the area cost The arithmetics share the hardware resources during the four time slots in the DCT /$ IEEE

2 656 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL 20, NO 4, APRIL 2012 core design A 100% hardware utilization is also achieved by using the proposed time scheduling strategy, and the 1st-D and 2nd-D DCT computations can be calculated at the same time Therefore, a high performance transform engine with high accuracy, small area, and a high-throughput rate has been achieved in this research This paper is organized as follows In Section II, the mathematical derivation of the BSD format distributed arithmetic is given The proposed D DCT architecture that includes an analysis of the coefficient bits, the hardware sharing architecture, and the proposed timing scheduling strategy is discussed in Section III Comparisons and discussions are presented in Section IV, and conclusions are drawn in Section V where for, and for Consider the 1-D 8-point DCT first where for, for non-zero, and and denote the input data and the transform output, respectively By neglecting the scaling factor 1/2, the 1-D 8-point DCT in (4) can be divided into even and odd parts, and, as listed in (5) and (6), respectively (4) II MATHEMATICAL DERIVATION OF BSD FORMAT DISTRIBUTED ARITHMETIC The inner product for a general matrix multiplication-andaccumulation can be written as follows: (5) (1) (6) where is a fixed coefficient and is the input data Assume that the coefficient is an -bit BSD number Equation (1) can be expressed as follows: where and (7) (8) where,, and The matrix is called the DA coefficient matrix In (2), can be calculated by adding or subtracting the with, and then the transform output can be obtained by shifting and adding every non-zero Thus, the inner product computation in (1) can be implemented by using shifters and adders instead of multipliers Therefore, a smaller area can be achieved by using BSD DA-based architecture III PROPOSED D DCT CORE DESIGN This section introduces the proposed 8 implementation The 2-D DCT is defined as (2) 8 2-D DCT core For the 2-D DCT core implementation, the three main strategies for increasing the hardware and time utilization, called the space-time scheduling (STS) strategy, are proposed and listed as follows: find the DA-precision bit length for the BSD representation to achieve system accuracy requirements; share the hardware resource in time to reduce area cost; plan an 100% hardware utilization by using time scheduling strategy A Analysis of the Coefficient Bits The seven internal coefficients from to for the 2-D DCT transformation are expressed as BSD representations in order to save computation time and hardware cost, as well as to achieve the requirements of PSNR The system PSNR is defined [1] as follows: (9) (10) (3) where the is the mean-square-error between the original image and the reconstructed image for each pixel

CHEN AND CHANG: HIGH PERFORMANCE VIDEO TRANSFORM ENGINE 657 Fig 1 Simulation for system PSNR in different BSD bit length expressions TABLE I NORMALIZED HW COST IN DIFFERENT BSD BIT LENGTH EXPRESSIONS

PSNR After inputting the original test image pixels to the forward and inverse DCT transform, the system PSNR vs the different BSD bit length expressions is illustrated in Fig 1(a), and the

for the different BSD bit length expressions is shown in Table I The HW cost is synthesized in area for each BSD bit length using a Synopsys Design Compiler with the Artisan TSMC 018- m Standard cell

3 CHEN AND CHANG: HIGH PERFORMANCE VIDEO TRANSFORM ENGINE 657 Fig 1 Simulation for system PSNR in different BSD bit length expressions TABLE I NORMALIZED HW COST IN DIFFERENT BSD BIT LENGTH EXPRESSIONS TABLE II 9-BIT BSD EXPRESSIONS FOR DCT COEFFICIENTS First, the BSD coefficients for different bit length expressions use the popular test images, Lena, Peppers, and Baboon, to simulate the system PSNR After inputting the original test image pixels to the forward and inverse DCT transform, the system PSNR vs the different BSD bit length expressions is illustrated in Fig 1(a), and the reconstructed images, ie, that have passed through the forward and inverse DCT for different BSD bit lengths, are also shown in Fig 1(b) The normalized hardware (HW) cost of processing element (PE) for the different BSD bit length expressions is shown in Table I The HW cost is synthesized in area for each BSD bit length using a Synopsys Design Compiler with the Artisan TSMC 018- m Standard cell library The gap in accuracy between the 8-bit and the 9-bit BSD expressions is shown in Fig 1(a), and the HW cost increases when the BSD bit length increases Consequently, the 9-bit BSD expression coefficient is chosen due to the tradeoff between the hardware cost and the system accuracy The BSD expressions and the accuracy of the DCT coefficients are listed in Table II, where the coefficient in Table II indicates the BSD number 1 The coefficient accuracy is defined as the signal-to-quantization-noise ratio (SQNR) shown in (11) (11) where and are the power of the desired floating-point signal and the quantization noise in this case, respectively B Hardware Sharing Strategy All modules in the 1-D DCT core, including the modified two-input butterfly (MBF2), the pre-reorder, the process element even (PEE), the process element odd (PEO), and the postreorder share the hardware resources in order to reduce the area cost Moreover, the D DCT core is implemented using a single 1-D DCT core and one TMEM The architecture of the 2-D DCT core is described in the following sections 1) Modified Butterfly Module: Equation (9) is easily implemented using a two-input butterfly module [18] called BF2 In general, the BF2 has a hardware utilization rate in the adder and subtracter of 50% In order to enable the hardware resources to be shared, additional multiplexers and Reorder Registers are added to the proposed MBF2 module as illustrated in Fig 2 The Reorder Registers consist of four word registers that use the control signals to select the input data and use enable signals to output or hold the data Similar to BF2, the operation of the proposed MBF2 has an eight-clock-cycle period In the first four cycles, the 1st-D input data shift into Reorder Registers 1, and the 2nd-D input data execute the operation of addition and substraction In the next four cycles, the operations of 1st-D and 2nd-D will be changed The 1st-D data calculate and in (9) by using adder and subtracter In the second four cycles, the results are fed into next stage, and the

658 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL 20, NO 4, APRIL 2012 TABLE III COEFFICIENTS AND INPUT DATA FROM MBF2 FOR THE FOUR SEPARATE TIME SLOT OUTPUTS TABLE IV EVEN

same time, the 2nd-D input data shift to Reorder Registers 2 Therefore, the adder and subtracter in MBF2 employ the operations of 1st-D and 2nd-D in turns to achieve 100% hardware utilization 2)

transform output can be calculated using four separate time slots in order to share the hardware where the time slot, the coefficient elements,, and the transform inputs from the MBF2 stage and

4 658 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL 20, NO 4, APRIL 2012 TABLE III COEFFICIENTS AND INPUT DATA FROM MBF2 FOR THE FOUR SEPARATE TIME SLOT OUTPUTS TABLE IV EVEN PART TRANSFORMATION TABLE Fig 2 Proposed MBF2 architecture resources Hence, (12) and (13) can be expressed as (14) and (15), respectively results shift to Reorder Registers 1 in each cycle At the same time, the 2nd-D input data shift to Reorder Registers 2 Therefore, the adder and subtracter in MBF2 employ the operations of 1st-D and 2nd-D in turns to achieve 100% hardware utilization 2) Pre-Reorder Module: In (5), the even part transform output can be modified as follows: (14) (15) (12) Also, the odd part transform output in (6) can be rewritten as (13) From (12) and (13), the transform output can be calculated using four separate time slots in order to share the hardware where the time slot, the coefficient elements,, and the transform inputs from the MBF2 stage and Equations (14) and (15) for the four time slots are summarized in Table III The proposed Pre-Reorder module follows the order of the inputs, and ) depending on each time slot indicated in Table III The hardware resources can be shared by reordering the inputs during the four separate time slots 3) Process Element Module: For DA-based computation, the even part and odd part transformations can be implemented using PEE and PEO, respectively From Table III, the even part transformation can be expanded for the DA-based computation formats and can share the hardware resources at the bit level The coefficient vector has two combinations of and

CHEN AND CHANG: HIGH PERFORMANCE VIDEO TRANSFORM ENGINE 659 Fig 4 Proposed post-reorder architecture tency of 52 cycles Based on the time scheduling strategy described in Section III-C, a hardware

data,,, and, the transform output needs only three adders (to compute,, and ), a Data-Distributed module, and one even part adder-tree (EAT) to obtain the result for during the four time slots The

5 CHEN AND CHANG: HIGH PERFORMANCE VIDEO TRANSFORM ENGINE 659 Fig 4 Proposed post-reorder architecture tency of 52 cycles Based on the time scheduling strategy described in Section III-C, a hardware utilization of 100% can be achieved Fig 3 Architecture for the proposed PEE and PEO for and transform outputs, respectively Table IV expresses (14) in the bit level formulation Using given input data,,, and, the transform output needs only three adders (to compute,, and ), a Data-Distributed module, and one even part adder-tree (EAT) to obtain the result for during the four time slots The Data-Distributed module consists of seven two-input multiplexers that are used to appoint different non-zero values for each weighted input during the four separate time slots, as shown in Table IV, and the EAT sums the different weighted values in tree-like adders to complete the transform output Similarly, the odd part transformation in (15) can be implemented in a DA-based format using four adders and one odd part adder-tree (OAT) The EAT and OAT can be implemented using the error-compensated adder tree [19] to improve the computation accuracy Moreover, there are three pipeline stages (two in both the EAT and the OAT) in each PEE and PEO module that enable high speed computation to be achieved The proposed architecture for both PEE and PEO is shown in Fig 3 4) Post-Reorder Module: For the last stage of the 1-D DCT computation, the data sequence after the PEE and PEO must be merged and repermuted in the Post-Reorder module The order of the original 8-point data sequence from the PEE and the PEO is After the Post-Reorder module permutes the data order, the data sequence is ordered as Fig 4 shows the proposed Post-Reorder architecture Two multiplexers select the data that is fed into the different Reorder Registers in order to permute the output order is the transform output for the 1st-D DCT that will input into the TMEM In addition, the 2nd-D DCT transform output is completed after the permutation by Reorder Registers 4 5) 2-D DCT Core Architecture: To save hardware costs, the proposed 2-D DCT core, as shown in Fig 5, is implemented using a single 1-D DCT core and one TMEM The 1-D DCT core includes an MBF2, a Pre-Reorder module, a PEE, a PEO, a Post-Reorder module, and one TMEM The TMEM is implemented using 64-word 12-bit dual-port registers and has a la- C Time Scheduling Strategy As a result of the time scheduling strategy, the 1st-D and 2nd-D transforms are able to be computed simultaneously, which increases hardware utilization be 100% The timing flow chart for the proposed strategy is illustrated in Fig 6 1) 1st-D Data Computation: In the first four cycles, the first four-point input data shift into the Reorder Register 1 During cycles 5 8, the MBF2 performs additions and subtractions using the first eight-point input data In the 9th cycle, the even-part of Pre-Reorder module obtains and reorders the input data output to the PEE based on Table III During cycles 10 13, the is calculated in the PEE module Based on the latency of the three pipeline stages in the PEE, is completed during cycles Also, the is performed in the PEO module during cycles and is finished in cycles Therefore, after the 16th cycle, the Post-Reorder module permutes the 1st-D transform results and inputs them into the TMEM 2) 2nd-D Data Computation: From the 17th cycle, the 1st-D transform data is input into the TMEM Due to the latency of the TMEM 52 cycles, the 2nd-D computation data transposed by the TMEM is sent to the input of Reorder Registers 2 in MBF2 at the 69th cycle Then, the adder and subtracter are operated during cycles to compute the first 2nd-D 8-point data sequences, and the first 2nd-D data is permuted by Pre-Reorder module during cycles The PEE and PEO calculate the first 2nd-D data during cycles and 82 85, respectively The Post-Reorder module finishes the 2nd-D transform results from the 84th cycle In addition, the 2-D DCT transform output data is obtained at the end of 84th cycle 3) Hardware Utilization: In Fig 6, the adder and subtracter in MBF2 is at 50% utilization during cycles 1 68 However, after the 68th cycle, the MBF2 achieves a 100% hardware utilization as a result of the 2nd-D data being input At the same time, the utilization in the PEE and PEO also reaches 100% from 74th and 78th cycles, respectively The Post-Reorder module does not work between cycles 1 12, but after the 16th cycle the 1st-D transform output are completed using Reorder Registers 3 in the Post-Reorder Similarly, the 2nd-D transform result is finished using Reorder Registers 4 in the Post-Reorder from the 84th cycle At the end of 84th cycle, the 1st-D and 2nd-D transform outputs and are obtained simultaneously Hence, the hardware utilization increases to 100% after the 84th

660 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL 20, NO 4, APRIL 2012 Fig 5 Proposed 2-D DCT architecture Fig 6 Timing scheduling for the proposed 2-D DCT core cycle, and the

straightforward In summary, following an analysis of the accuracy analysis of the coefficients, the DA-precision bit length was chosen as a 9-bit BSD expression In this way, the hardware cost is

6 660 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL 20, NO 4, APRIL 2012 Fig 5 Proposed 2-D DCT architecture Fig 6 Timing scheduling for the proposed 2-D DCT core cycle, and the latency of the proposed D DCT core is 84 clock cycles Based on the proposed time scheduling strategy, a 100% hardware utilization is achieved, and computation of the data sequence is straightforward In summary, following an analysis of the accuracy analysis of the coefficients, the DA-precision bit length was chosen as a 9-bit BSD expression In this way, the hardware cost is reduced by using the BSD 9-bit DA-precision instead of the 12-bit used in previous works Using the proposed hardware sharing strategy, the DCT computation shares the hardware resources during the four time slots, thereby reducing the area cost In addition, a 100% hardware utilization can be achieved based on the proposed time scheduling strategy, effectively achieving a highthroughput-rate design Therefore, a high-accuracy, small-area, and high-throughput-rate 2-D DCT core has been obtained by applying the proposed STS strategy IV DISCUSSION AND COMPARISONS In this section, the system accuracy test for the proposed 2-D DCT core is discussed Then, a comparison of the proposed 2-D DCT core with previous works is also presented Finally, the characteristics of the implemented chip and FPGA are described A System Accuracy Test Seven test images used to check system accuracy are all pixels in size, with each pixel being represented by 8-bit 256 gray level data After the original test image pixels are fed

CHEN AND CHANG: HIGH PERFORMANCE VIDEO TRANSFORM ENGINE 661 Fig 7 Accuracy verification flow for proposed DCT core TABLE V PSNR PERFORMANCE FOR EACH TEST PATTERNS TABLE VI PSNR VERSUS QF FOR THE

double-precision operations The verification flow follows the precedent of previous works ([4], [7], and [20]) test flow illustrated as Fig 7 The PSNR values of each test images verified by the test

7 CHEN AND CHANG: HIGH PERFORMANCE VIDEO TRANSFORM ENGINE 661 Fig 7 Accuracy verification flow for proposed DCT core TABLE V PSNR PERFORMANCE FOR EACH TEST PATTERNS TABLE VI PSNR VERSUS QF FOR THE PROPOSED 2-D DCT APPLYING TO JPEG COMPRESSION FLOW into the proposed 2-D DCT core, the transform output data is captured and passed to MATLAB tool in order to compute the inverse DCT using 64-bit double-precision operations The verification flow follows the precedent of previous works ([4], [7], and [20]) test flow illustrated as Fig 7 The PSNR values of each test images verified by the test flow are listed in Table V, and the average PSNR closes to 52 db using the proposed D DCT core Furthermore, the proposed 2-D DCT core was introduced to enable utilization of JPEG compression flow [21], [22] to evaluate the proposed 2-D DCT applied in real compression systems The quality factor (QF) dominates the compression quality and data size, and therefore QF affects the system PSNR of the JPEG compression Table VI lists the PSNR versus QF for the proposed 2-D DCT core applying to JPEG, and Fig 8 illustrates both the original and the compressed versions of the test image Lena using the JPEG compression flow The quality of the compressed image is excellent B Comparison With Other 2-D DCT Architectures Table VII compares the proposed D DCT core with previous works In [4], Lin et al derived an SFFT format from a 2-D DCT and shared the hardware resources for the FFT/ IFFT/2-D DCT computation in a triple-mode processor design The 2-D DCT core is implemented in a direct 2-D method based on a pipeline radix- single delay feedback path architecture In [7], a 2-D DCT core used NEDA architecture with two 1-D cores and one TMEM to achieve a low-cost design Speed limitations exist in the operations of serial shifting and addition in the NEDA operations Huang et al improved the throughput rate using adder-tree (AT) instead of shifting and addition operations for NEDA architecture [9] However, the parallel input and output in the NEDA architecture required a large amount of I/O resources The 2-D DCT core using a single 1-D core and one TMEM for a multiplier-based DCT implementation was addressed in [10] An area-saving serial-input and parallel-output transpose memory was also implemented in the 2-D DCT core to achieve a small-area design operated at frequency of 55 MHz However, the 1-D core could not compute 1st-D and 2nd-D transformation simultaneously Hence, the throughput rate was half of the operation frequency Therefore, a hardware sharing technique was addressed in the 2-D DCT core to deal with throughput rate reduction, and the 2-D DCT core achieved a throughput of 100 Mpels/s [13] The proposed 2-D DCT core employs a single 1-D core and one TMEM in order to reduce the core area Also, the 1-D core utilizes 9 adders and 2 adder-trees (AT) using the proposed hardware sharing architecture in order to reduce hardware cost Meaning that the throughput rate of the proposed 2-D DCT core is enhanced based on the time scheduling strategy In this way, the proposed 2-D core is able to compute 1st-D and 2nd-D transformation simultaneously, and a 167 MHz throughput rate is achieved Therefore, the proposed 2-D DCT core has the highest hardware efficiency, defined as follows: Hardware Efficiency pels/s-gate (16) A comparison with previous works is summarized in Table VII Noted that previous works presented only simulation results, while proposed work has chip implementation with design for test (DFT) and measured results For fair comparisons, the simulation results of proposed work are also shown (Remove DFT considerations in chip implement that needs extra area for memory built-in self-test circuit and scan flip-slops) The gate count of STS DCT without DFT insertion is 122 K and the hardware efficiency of STS DCT is equal to 1369 that is the largest in the comparison Table VII On the other hand, the proposed 2-D DCT achieves 52 db PSNR value, that is the medium accuracy compared with other pervious works, verified by the test flow in Fig 7 The power consumption is another issue in the circuit designs There are many different technologies to implement DCT core in previous works In order to have a fair comparison, the power is normalized to 018- m technology process Similar to [23] as listed in (17) and is also summarized in Table VII The proposed 2-D DCT core has medium power consumption between pervious works Normalized Power (17)

662 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL 20, NO 4, APRIL 2012 Fig 8 Compressed image using JPEG compression processing with the proposed 2-D DCT core (a) Original

is the chip measurement result with DFT insertion and is simulation result without DFT insertion The power result is estimated for 1-D DCT operation C Chip Implementation Implemented in a 18-V TSMC

8 662 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL 20, NO 4, APRIL 2012 Fig 8 Compressed image using JPEG compression processing with the proposed 2-D DCT core (a) Original image (b) Compressed image with QF = 99 TABLE VII COMPARISON OF DIFFERENT 2-D DCT ARCHITECTURES WITH THE PROPOSED ARCHITECTURE Four transistors per NAND2 gate for different technology 5=(): Where 5 is the chip measurement result with DFT insertion and is simulation result without DFT insertion The power result is estimated for 1-D DCT operation C Chip Implementation Implemented in a 18-V TSMC 018- m 1P6M CMOS process, the proposed D DCT core uses the Synopsys Design Compiler to synthesize the RTL code and the Cadence SoC Encounter for placement and routing (P&R) The proposed DCT core has a latency of 84 clock cycles and is operated at 167 MHz to meet HDTV specifications For the design for test (DFT) considerations, the proposed core uses the SynTest SRAMBIST to generate the memory built-in self-test (BIST) circuit and the Synopsys Design Compiler to synthesize scan flip-flops, and the total gate counts with DFT insertion of the proposed core is 158 K The photomicrograph of the proposed core with marking each module s boundary is shown in Fig 9 and the measured characteristics are listed in Table VIII D FPGA Implementation The proposed 2-D DCT core synthesized using Xilinx ISE 101 and Xilinx XC2VP30 FPGA can be operated at a clock frequency of 110 MHz Table IX shows a comparison of the proposed 2-D DCT core with previous FPGA implementations The multiplier-less coordinate rotation digital computer (CORDIC) architecture was presented in [15] Sun et al utilized 120 adders and 80 shifters performing a 2-D quantized discrete cosine integer transform (QDCIT) to achieve a high clock rate of 149 MHz, but with large FPGA area resources Also, a 2-D DCT core that was optimized in both delay and area was addressed by Tumeo et al [14], who presented a balance point between the delay and the area for multiplier-based pipeline DCT design with 19 adders/subtracters and 4 multipliers in their DCT core

CHEN AND CHANG: HIGH PERFORMANCE VIDEO TRANSFORM ENGINE 663 saving in area over the NEDA architecture for a DA-based DCT design Furthermore, the proposed time scheduling strategy arranges the

DCT core utilizing high-accuracy, a small-area, and a high-throughput rate has been achieved using the proposed STS strategy Finally, the proposed high performance 2-D DCT core is fabricated using a

OF 2-D DCT ARCHITECTURES IN FPGAS operated at clock frequency of 107 MHz The proposed DCT core has lower area resources and a medium operation frequency in the XC2VP30 FPGA implementations V

9 CHEN AND CHANG: HIGH PERFORMANCE VIDEO TRANSFORM ENGINE 663 saving in area over the NEDA architecture for a DA-based DCT design Furthermore, the proposed time scheduling strategy arranges the computation time for each process element The 1-D core can calculate the 1st-D and 2nd-D transformations simultaneously, thereby achieving a high throughput rate Therefore, a high performance D DCT core utilizing high-accuracy, a small-area, and a high-throughput rate has been achieved using the proposed STS strategy Finally, the proposed high performance 2-D DCT core is fabricated using a TSMC 018-1P6M CMOS process and implemented in the XC2VP30 FPGA Fig 9 Photomicrograph of the proposed DCT core TABLE VIII CHIP CHARACTERISTICS OF THE PROPOSED 2-D DCT ARCHITECTURE TABLE IX COMPARISONS OF 2-D DCT ARCHITECTURES IN FPGAS operated at clock frequency of 107 MHz The proposed DCT core has lower area resources and a medium operation frequency in the XC2VP30 FPGA implementations V CONCLUSION This paper proposes a high performance video transform engine using the STS strategy The proposed 2-D DCT core employs a single 1-D DCT core and one TMEM with a small area Based on the test image simulations, 9-bit DA-precision is chosen in order to meet the PSNR requirements, and the hardware sharing architecture enables the arithmetic resources to be shared based on time so as to reduce area cost Hence, the number of adders/substracters in 1-D DCT core allows a 74% REFERENCES [1] Y Wang, J Ostermann, and Y Zhang, Video Processing and Communications, 1st ed Englewood Cliffs, NJ: Prentice-Hall, 2002 [2] E Feig and S Winograd, Fast algorithms for the discrete cosine transform, IEEE Trans Signal Process, vol 40, no 9, pp , Sep 1992 [3] Y P Lee, T H Chen, L G Chen, M J Chen, and C W Ku, A cost-effective architecture for two-dimensional DCT/IDCT using direct method, IEEE Trans Circuits Syst Video Technol, vol 7, no 3, pp , Jun 1997 [4] C T Lin, Y C Yu, and L D Van, Cost-effective triple-mode reconfigurable pipeline FFT/IFFT/2-D DCT processor, IEEE Trans Very Large Scale Integr (VLSI) Syst, vol 16, no 8, pp , Aug 2008 [5] M Alam, W Badawy, and G Jullien, A new time distributed DCT architecture for MPEG-4 hardware reference model, IEEE Trans Circuits Syst Video Technol, vol 15, no 5, pp , May 2005 [6] T Xanthopoulos and A P Chandrakasan, A low-power DCT core using adaptive bitwidth and arithmetic activity exploiting signal correlations and quantization, IEEE J Solid-State Circuits, vol 35, no 5, pp , May 2000 [7] A M Shams, A Chidanandan, W Pan, and M A Bayoumi, NEDA: A low-power high-performance DCT architecture, IEEE Trans Signal Process, vol 54, no 3, pp , Mar 2006 [8] C Peng, X Cao, D Yu, and X Zhang, A 250 MHz optimized distributed architecture of 2D DCT, in Proc Int Conf ASIC, 2007, pp [9] C Y Huang, L F Chen, and Y K Lai, A high-speed 2-D transform architecture with unique kernel for multi-standard video applications, in Proc IEEE Int Symp Circuits Syst, 2008, pp [10] S C Hsia and S H Wang, Shift-register-based data transposition for cost-effective discrete cosine transform, IEEE Trans Very Large Scale Integr (VLSI) Syst, vol 15, no 6, pp , Jun 2007 [11] J I Guo, R C Ju, and J W Chen, An efficient 2-D DCT/IDCT core design using cyclic convolution and adder-based realization, IEEE Trans Circuits Syst Video Technol, vol 14, no 4, pp , Apr 2004 [12] H C Hsu, K B Lee, N Y Chang, and T S Chang, Architecture design of shape-adaptive discrete cosine transform and its inverse for MPEG-4 video coding, IEEE Trans Circuits Syst Video Technol, vol 18, no 3, pp , Mar 2008 [13] A Madisetti and A N Willson, A 100 MHz 2-D DCT/IDCT processor for HDTV applications, IEEE Trans Circuits Syst Video Technol, vol 5, no 2, pp , Apr 1995 [14] A Tumeo, M Monchiero, G Palermo, F Ferrandi, and D Sciuto, A pipelined fast 2D-DCT accelerator for FPGA-based SOCs, in Proc IEEE Comput Soc Annu Symp VLSI, 2007, pp [15] C C Sun, P Donner, and J Gotze, Low-complexity multi-purpose IP core for quantized discrete cosine and integer transform, in Proc IEEE Int Symp Circuits Syst, 2009, pp [16] S Ghosh, S Venigalla, and M Bayoumi, Design and implementaion of a 2D-DCT architecture using coefficient distributed arithmetic, in Proc IEEE Comput Soc Annu Symp VLSI, 2005, pp [17] L V Agostini, I S Silva, and S Bampi, Pipelined fast 2D DCT architecture for JPEG image compression, in Proc IEEE Symp Integr Circuits Syst Des, 2001, pp [18] C H Chang, C L Wang, and Y T Chang, Efficient VLSI architectures for fast computation of the discrete Fourier transform and its inverse, IEEE Trans Signal Process, vol 48, no 11, pp , Nov 2000

664 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL 20, NO 4, APRIL 2012 [19] Y H Chen, T Y Chang, and C Y Li, High throughput DA-based DCT with high accuracy error-compensated

2000, vol 3, pp 114 117 [21] Information Technology Digital Compression and Coding of Continuous-Tone Still Images: Requirements and Guidelines, ISO/IEC 10918-1, 1991 [22] W B Pennebaker and J L

10 664 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL 20, NO 4, APRIL 2012 [19] Y H Chen, T Y Chang, and C Y Li, High throughput DA-based DCT with high accuracy error-compensated adder tree, IEEE Trans Very Large Scale Integr (VLSI) Syst, 2010, to be published [20] W Pan, A fast 2-D DCT algorithm via distributed arithmetic optimization, in Proc IEEE Int Conf Image Process, 2000, vol 3, pp [21] Information Technology Digital Compression and Coding of Continuous-Tone Still Images: Requirements and Guidelines, ISO/IEC , 1991 [22] W B Pennebaker and J L Mitchell, JPEG Still Image Data Compression Standard New York: Van Nostrand Reinhold, 1992 [23] S N Tang, J W Tsai, and T Y Chang, A 24-Gs/s FFT processor for OFDM-based WPAN applications, IEEE Trans Circuits Syst II, Exp Briefs, vol 57, no 6, pp , Jun 2010 Tsin-Yuan Chang (S 87 M 90) received the BS degree in electrical engineering from National Tsing Hua University, Hsinchu, Taiwan, in 1982, and the MS and PhD degrees from Michigan State University, East Lansing, in 1987 and 1989, respectively, both in electrical engineering He is an Associate Professor with the Department of Electrical Engineering, National Tsing Hua University His research interests include IC design and testing, Computer arithmetic, and VLSI DSP Yuan-Ho Chen (S 10) received the BS degree from Chang Gung University, Taoyuan, Taiwan, in 2002, and the MS degree from the National Tsing Hua University (NTHU), Hsinchu, Taiwan, in 2004, where he is currently pursuing the PhD degree in electrical engineering His research interests include video/image and digital signal processing, VLSI architecture design, VLSI implementation, computer arithmetic, and signal processing for wireless communication

DUE to the high computational complexity and real-time

DUE to the high computational complexity and real-time IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 15, NO. 3, MARCH 2005 445 A Memory-Efficient Realization of Cyclic Convolution and Its Application to Discrete Cosine Transform Hun-Chen