DISCRETE COSINE TRANSFORM (DCT) is a widely

Size: px
Start display at page:

Download "DISCRETE COSINE TRANSFORM (DCT) is a widely"

Transcription

1 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL 20, NO 4, APRIL A High Performance Video Transform Engine by Using Space-Time Scheduling Strategy Yuan-Ho Chen, Student Member, IEEE, and Tsin-Yuan Chang, Member, IEEE Abstract In this paper, a spatial and time scheduling strategy, called the space-time scheduling (STS) strategy, that achieves high image resolutions in real-time systems is proposed The proposed spatial scheduling strategy includes the ability to choose the distributed arithmetic (DA)-precision bit length, a hardware sharing architecture that reduces the hardware cost, and the proposed time scheduling strategy arranges different dimensional computations in that it can calculate first-dimensional and second-dimensional transformations simultaneously in single 1-D discrete cosine transform (DCT) core to reach a hardware utilization of 100% The DA-precision bit length is chosen as 9 bits instead of the traditional 12 bits based on test image simulations In addition, the proposed hardware sharing architecture employs a binary signed-digit DA architecture that enables the arithmetic resources to be shared during the four time slots For this reason, the proposed 2-D DCT core achieves high accuracy with a small area and a high throughput rate and is verified using a TSMC 018- m 1P6M CMOS process chip implementation Measurement results show that the core has a latency of 84 clock cycles with a 52 db peak-signal-to-noise-ratio and is operated at 167 MHz with 158 K gate counts Index Terms Binary signed-digit (BSD), discrete cosine transform (DCT), distributed arithmetic (DA)-based, space-time scheduling (STS) I INTRODUCTION DISCRETE COSINE TRANSFORM (DCT) is a widely used transform engine for image and video compression applications [1] In recent years, the development of visual media has been progressed towards high-resolution specifications, such as high definition television (HDTV) Consequently, a high-accuracy and high-throughput rate component is needed to meet future specifications In addition, in order to reduce the manufacturing costs of the integrated circuit (IC), a low hardware cost design is also required Therefore, a high performance video transform engine that utilizes high accuracy, a small area, and a high-throughput rate is desired for VLSI designs The 2-D DCT core design has often been implemented using either direct [2] [4] or indirect [5] [13] methods The direct Manuscript received February 22, 2010; revised August 09, 2010 and December 30, 2010; accepted January 24, 2011 Date of publication February 28, 2011; date of current version March 12, 2012 This work was supported in part by the Chip implementation Center and National Science Council under project number CIC T18-98C-12a and NSC E , respectively The authors are with the Department of Electrical Engineering, National Tsing Hua University, Hsinchu 30013, Taiwan ( yhchen@larceenthu edutw) Color versions of one or more of the figures in this paper are available online at Digital Object Identifier /TVLSI methods include fast algorithms that reduce the computation complexity by mapping and rotating the DCT transform into a complex number [3] However, the structure of the DCT is not as regular as that of the fast Fourier transform (FFT) A regular 2-D DCT core using the direct method that derives the 2-D shifted FFT (SFFT) from the 2-D DCT algorithm format, and shares the hardware in FFT/IFFT/2-D DCT computation is implemented in [4] On the other hand, the 2-D DCT cores using the indirect method are implemented based on transpose memory and have the following two structures: 1) two 1-D DCT cores and one transpose memory (TMEM) and 2) a single 1-D DCT core and one TMEM In the first structure, the 2-D DCT core has a high throughput rate, because the two 1-D DCT cores compute the transformation simultaneously [5] [9] In order to reduce the area overhead, a single 1-D DCT core is applied in the second structure [10] [13], thereby saving hardware costs Hsia et al [10] present a 2-D DCT with a single 1-D DCT core and one custom transposed memory that can transpose data sequences using serial-input and parallel-output and requires only a small area, but the speed is reduced because the single DCT core performs the first-dimensional (1st-D) and second-dimensional (2nd-D) transformations at different times Therefore, Guo et al use two parallel paths to deal with the reduction of throughput rate in the second structure [11] However, the additional input buffers are needed in order to temporarily store the input data during the 2nd-D transformation Madisetti et al present a hardware sharing technique to implement the 2-D DCT core using a single 1-D DCT core [13] The 1-D DCT core can calculate the 1st-D and 2nd-D DCT computations simultaneously, and the throughput achieves 100 Mpels/s However, multiplier-based processing element requires a large amount of circuit area in [13] As a result, it is obvious that a tradeoff is required between the hardware cost and the speed Tumeo et al find a balance point between the area required and the speed for 2-D DCT designs [14] and present a multiplier-based pipeline fast 2-D DCT accelerator implemented using a field-programmable gate arrays (FPGA) platform [14] Additionally, several 2-D DCT designs are implemented based on FPGA for fast verification [14] [17] In this paper, an D DCT core that consists of a single 1-D DCT core and one TMEM is proposed using a strategy known as space-time scheduling (STS) Due to the accuracy simulations in DA-based binary signed-digit (BSD) expression, 9-bit DA precision is chosen in order to meet the requirements of the peak-signal-to-noise-ratio (PSNR) outlined in previous works [5] [7] Furthermore, the proposed DCT core is designed based on a hardware sharing architecture with BSD DA-based computation so as to reduce the area cost The arithmetics share the hardware resources during the four time slots in the DCT /$ IEEE

2 656 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL 20, NO 4, APRIL 2012 core design A 100% hardware utilization is also achieved by using the proposed time scheduling strategy, and the 1st-D and 2nd-D DCT computations can be calculated at the same time Therefore, a high performance transform engine with high accuracy, small area, and a high-throughput rate has been achieved in this research This paper is organized as follows In Section II, the mathematical derivation of the BSD format distributed arithmetic is given The proposed D DCT architecture that includes an analysis of the coefficient bits, the hardware sharing architecture, and the proposed timing scheduling strategy is discussed in Section III Comparisons and discussions are presented in Section IV, and conclusions are drawn in Section V where for, and for Consider the 1-D 8-point DCT first where for, for non-zero, and and denote the input data and the transform output, respectively By neglecting the scaling factor 1/2, the 1-D 8-point DCT in (4) can be divided into even and odd parts, and, as listed in (5) and (6), respectively (4) II MATHEMATICAL DERIVATION OF BSD FORMAT DISTRIBUTED ARITHMETIC The inner product for a general matrix multiplication-andaccumulation can be written as follows: (5) (1) (6) where is a fixed coefficient and is the input data Assume that the coefficient is an -bit BSD number Equation (1) can be expressed as follows: where and (7) (8) where,, and The matrix is called the DA coefficient matrix In (2), can be calculated by adding or subtracting the with, and then the transform output can be obtained by shifting and adding every non-zero Thus, the inner product computation in (1) can be implemented by using shifters and adders instead of multipliers Therefore, a smaller area can be achieved by using BSD DA-based architecture III PROPOSED D DCT CORE DESIGN This section introduces the proposed 8 implementation The 2-D DCT is defined as (2) 8 2-D DCT core For the 2-D DCT core implementation, the three main strategies for increasing the hardware and time utilization, called the space-time scheduling (STS) strategy, are proposed and listed as follows: find the DA-precision bit length for the BSD representation to achieve system accuracy requirements; share the hardware resource in time to reduce area cost; plan an 100% hardware utilization by using time scheduling strategy A Analysis of the Coefficient Bits The seven internal coefficients from to for the 2-D DCT transformation are expressed as BSD representations in order to save computation time and hardware cost, as well as to achieve the requirements of PSNR The system PSNR is defined [1] as follows: (9) (10) (3) where the is the mean-square-error between the original image and the reconstructed image for each pixel

3 CHEN AND CHANG: HIGH PERFORMANCE VIDEO TRANSFORM ENGINE 657 Fig 1 Simulation for system PSNR in different BSD bit length expressions TABLE I NORMALIZED HW COST IN DIFFERENT BSD BIT LENGTH EXPRESSIONS TABLE II 9-BIT BSD EXPRESSIONS FOR DCT COEFFICIENTS First, the BSD coefficients for different bit length expressions use the popular test images, Lena, Peppers, and Baboon, to simulate the system PSNR After inputting the original test image pixels to the forward and inverse DCT transform, the system PSNR vs the different BSD bit length expressions is illustrated in Fig 1(a), and the reconstructed images, ie, that have passed through the forward and inverse DCT for different BSD bit lengths, are also shown in Fig 1(b) The normalized hardware (HW) cost of processing element (PE) for the different BSD bit length expressions is shown in Table I The HW cost is synthesized in area for each BSD bit length using a Synopsys Design Compiler with the Artisan TSMC 018- m Standard cell library The gap in accuracy between the 8-bit and the 9-bit BSD expressions is shown in Fig 1(a), and the HW cost increases when the BSD bit length increases Consequently, the 9-bit BSD expression coefficient is chosen due to the tradeoff between the hardware cost and the system accuracy The BSD expressions and the accuracy of the DCT coefficients are listed in Table II, where the coefficient in Table II indicates the BSD number 1 The coefficient accuracy is defined as the signal-to-quantization-noise ratio (SQNR) shown in (11) (11) where and are the power of the desired floating-point signal and the quantization noise in this case, respectively B Hardware Sharing Strategy All modules in the 1-D DCT core, including the modified two-input butterfly (MBF2), the pre-reorder, the process element even (PEE), the process element odd (PEO), and the postreorder share the hardware resources in order to reduce the area cost Moreover, the D DCT core is implemented using a single 1-D DCT core and one TMEM The architecture of the 2-D DCT core is described in the following sections 1) Modified Butterfly Module: Equation (9) is easily implemented using a two-input butterfly module [18] called BF2 In general, the BF2 has a hardware utilization rate in the adder and subtracter of 50% In order to enable the hardware resources to be shared, additional multiplexers and Reorder Registers are added to the proposed MBF2 module as illustrated in Fig 2 The Reorder Registers consist of four word registers that use the control signals to select the input data and use enable signals to output or hold the data Similar to BF2, the operation of the proposed MBF2 has an eight-clock-cycle period In the first four cycles, the 1st-D input data shift into Reorder Registers 1, and the 2nd-D input data execute the operation of addition and substraction In the next four cycles, the operations of 1st-D and 2nd-D will be changed The 1st-D data calculate and in (9) by using adder and subtracter In the second four cycles, the results are fed into next stage, and the

4 658 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL 20, NO 4, APRIL 2012 TABLE III COEFFICIENTS AND INPUT DATA FROM MBF2 FOR THE FOUR SEPARATE TIME SLOT OUTPUTS TABLE IV EVEN PART TRANSFORMATION TABLE Fig 2 Proposed MBF2 architecture resources Hence, (12) and (13) can be expressed as (14) and (15), respectively results shift to Reorder Registers 1 in each cycle At the same time, the 2nd-D input data shift to Reorder Registers 2 Therefore, the adder and subtracter in MBF2 employ the operations of 1st-D and 2nd-D in turns to achieve 100% hardware utilization 2) Pre-Reorder Module: In (5), the even part transform output can be modified as follows: (14) (15) (12) Also, the odd part transform output in (6) can be rewritten as (13) From (12) and (13), the transform output can be calculated using four separate time slots in order to share the hardware where the time slot, the coefficient elements,, and the transform inputs from the MBF2 stage and Equations (14) and (15) for the four time slots are summarized in Table III The proposed Pre-Reorder module follows the order of the inputs, and ) depending on each time slot indicated in Table III The hardware resources can be shared by reordering the inputs during the four separate time slots 3) Process Element Module: For DA-based computation, the even part and odd part transformations can be implemented using PEE and PEO, respectively From Table III, the even part transformation can be expanded for the DA-based computation formats and can share the hardware resources at the bit level The coefficient vector has two combinations of and

5 CHEN AND CHANG: HIGH PERFORMANCE VIDEO TRANSFORM ENGINE 659 Fig 4 Proposed post-reorder architecture tency of 52 cycles Based on the time scheduling strategy described in Section III-C, a hardware utilization of 100% can be achieved Fig 3 Architecture for the proposed PEE and PEO for and transform outputs, respectively Table IV expresses (14) in the bit level formulation Using given input data,,, and, the transform output needs only three adders (to compute,, and ), a Data-Distributed module, and one even part adder-tree (EAT) to obtain the result for during the four time slots The Data-Distributed module consists of seven two-input multiplexers that are used to appoint different non-zero values for each weighted input during the four separate time slots, as shown in Table IV, and the EAT sums the different weighted values in tree-like adders to complete the transform output Similarly, the odd part transformation in (15) can be implemented in a DA-based format using four adders and one odd part adder-tree (OAT) The EAT and OAT can be implemented using the error-compensated adder tree [19] to improve the computation accuracy Moreover, there are three pipeline stages (two in both the EAT and the OAT) in each PEE and PEO module that enable high speed computation to be achieved The proposed architecture for both PEE and PEO is shown in Fig 3 4) Post-Reorder Module: For the last stage of the 1-D DCT computation, the data sequence after the PEE and PEO must be merged and repermuted in the Post-Reorder module The order of the original 8-point data sequence from the PEE and the PEO is After the Post-Reorder module permutes the data order, the data sequence is ordered as Fig 4 shows the proposed Post-Reorder architecture Two multiplexers select the data that is fed into the different Reorder Registers in order to permute the output order is the transform output for the 1st-D DCT that will input into the TMEM In addition, the 2nd-D DCT transform output is completed after the permutation by Reorder Registers 4 5) 2-D DCT Core Architecture: To save hardware costs, the proposed 2-D DCT core, as shown in Fig 5, is implemented using a single 1-D DCT core and one TMEM The 1-D DCT core includes an MBF2, a Pre-Reorder module, a PEE, a PEO, a Post-Reorder module, and one TMEM The TMEM is implemented using 64-word 12-bit dual-port registers and has a la- C Time Scheduling Strategy As a result of the time scheduling strategy, the 1st-D and 2nd-D transforms are able to be computed simultaneously, which increases hardware utilization be 100% The timing flow chart for the proposed strategy is illustrated in Fig 6 1) 1st-D Data Computation: In the first four cycles, the first four-point input data shift into the Reorder Register 1 During cycles 5 8, the MBF2 performs additions and subtractions using the first eight-point input data In the 9th cycle, the even-part of Pre-Reorder module obtains and reorders the input data output to the PEE based on Table III During cycles 10 13, the is calculated in the PEE module Based on the latency of the three pipeline stages in the PEE, is completed during cycles Also, the is performed in the PEO module during cycles and is finished in cycles Therefore, after the 16th cycle, the Post-Reorder module permutes the 1st-D transform results and inputs them into the TMEM 2) 2nd-D Data Computation: From the 17th cycle, the 1st-D transform data is input into the TMEM Due to the latency of the TMEM 52 cycles, the 2nd-D computation data transposed by the TMEM is sent to the input of Reorder Registers 2 in MBF2 at the 69th cycle Then, the adder and subtracter are operated during cycles to compute the first 2nd-D 8-point data sequences, and the first 2nd-D data is permuted by Pre-Reorder module during cycles The PEE and PEO calculate the first 2nd-D data during cycles and 82 85, respectively The Post-Reorder module finishes the 2nd-D transform results from the 84th cycle In addition, the 2-D DCT transform output data is obtained at the end of 84th cycle 3) Hardware Utilization: In Fig 6, the adder and subtracter in MBF2 is at 50% utilization during cycles 1 68 However, after the 68th cycle, the MBF2 achieves a 100% hardware utilization as a result of the 2nd-D data being input At the same time, the utilization in the PEE and PEO also reaches 100% from 74th and 78th cycles, respectively The Post-Reorder module does not work between cycles 1 12, but after the 16th cycle the 1st-D transform output are completed using Reorder Registers 3 in the Post-Reorder Similarly, the 2nd-D transform result is finished using Reorder Registers 4 in the Post-Reorder from the 84th cycle At the end of 84th cycle, the 1st-D and 2nd-D transform outputs and are obtained simultaneously Hence, the hardware utilization increases to 100% after the 84th

6 660 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL 20, NO 4, APRIL 2012 Fig 5 Proposed 2-D DCT architecture Fig 6 Timing scheduling for the proposed 2-D DCT core cycle, and the latency of the proposed D DCT core is 84 clock cycles Based on the proposed time scheduling strategy, a 100% hardware utilization is achieved, and computation of the data sequence is straightforward In summary, following an analysis of the accuracy analysis of the coefficients, the DA-precision bit length was chosen as a 9-bit BSD expression In this way, the hardware cost is reduced by using the BSD 9-bit DA-precision instead of the 12-bit used in previous works Using the proposed hardware sharing strategy, the DCT computation shares the hardware resources during the four time slots, thereby reducing the area cost In addition, a 100% hardware utilization can be achieved based on the proposed time scheduling strategy, effectively achieving a highthroughput-rate design Therefore, a high-accuracy, small-area, and high-throughput-rate 2-D DCT core has been obtained by applying the proposed STS strategy IV DISCUSSION AND COMPARISONS In this section, the system accuracy test for the proposed 2-D DCT core is discussed Then, a comparison of the proposed 2-D DCT core with previous works is also presented Finally, the characteristics of the implemented chip and FPGA are described A System Accuracy Test Seven test images used to check system accuracy are all pixels in size, with each pixel being represented by 8-bit 256 gray level data After the original test image pixels are fed

7 CHEN AND CHANG: HIGH PERFORMANCE VIDEO TRANSFORM ENGINE 661 Fig 7 Accuracy verification flow for proposed DCT core TABLE V PSNR PERFORMANCE FOR EACH TEST PATTERNS TABLE VI PSNR VERSUS QF FOR THE PROPOSED 2-D DCT APPLYING TO JPEG COMPRESSION FLOW into the proposed 2-D DCT core, the transform output data is captured and passed to MATLAB tool in order to compute the inverse DCT using 64-bit double-precision operations The verification flow follows the precedent of previous works ([4], [7], and [20]) test flow illustrated as Fig 7 The PSNR values of each test images verified by the test flow are listed in Table V, and the average PSNR closes to 52 db using the proposed D DCT core Furthermore, the proposed 2-D DCT core was introduced to enable utilization of JPEG compression flow [21], [22] to evaluate the proposed 2-D DCT applied in real compression systems The quality factor (QF) dominates the compression quality and data size, and therefore QF affects the system PSNR of the JPEG compression Table VI lists the PSNR versus QF for the proposed 2-D DCT core applying to JPEG, and Fig 8 illustrates both the original and the compressed versions of the test image Lena using the JPEG compression flow The quality of the compressed image is excellent B Comparison With Other 2-D DCT Architectures Table VII compares the proposed D DCT core with previous works In [4], Lin et al derived an SFFT format from a 2-D DCT and shared the hardware resources for the FFT/ IFFT/2-D DCT computation in a triple-mode processor design The 2-D DCT core is implemented in a direct 2-D method based on a pipeline radix- single delay feedback path architecture In [7], a 2-D DCT core used NEDA architecture with two 1-D cores and one TMEM to achieve a low-cost design Speed limitations exist in the operations of serial shifting and addition in the NEDA operations Huang et al improved the throughput rate using adder-tree (AT) instead of shifting and addition operations for NEDA architecture [9] However, the parallel input and output in the NEDA architecture required a large amount of I/O resources The 2-D DCT core using a single 1-D core and one TMEM for a multiplier-based DCT implementation was addressed in [10] An area-saving serial-input and parallel-output transpose memory was also implemented in the 2-D DCT core to achieve a small-area design operated at frequency of 55 MHz However, the 1-D core could not compute 1st-D and 2nd-D transformation simultaneously Hence, the throughput rate was half of the operation frequency Therefore, a hardware sharing technique was addressed in the 2-D DCT core to deal with throughput rate reduction, and the 2-D DCT core achieved a throughput of 100 Mpels/s [13] The proposed 2-D DCT core employs a single 1-D core and one TMEM in order to reduce the core area Also, the 1-D core utilizes 9 adders and 2 adder-trees (AT) using the proposed hardware sharing architecture in order to reduce hardware cost Meaning that the throughput rate of the proposed 2-D DCT core is enhanced based on the time scheduling strategy In this way, the proposed 2-D core is able to compute 1st-D and 2nd-D transformation simultaneously, and a 167 MHz throughput rate is achieved Therefore, the proposed 2-D DCT core has the highest hardware efficiency, defined as follows: Hardware Efficiency pels/s-gate (16) A comparison with previous works is summarized in Table VII Noted that previous works presented only simulation results, while proposed work has chip implementation with design for test (DFT) and measured results For fair comparisons, the simulation results of proposed work are also shown (Remove DFT considerations in chip implement that needs extra area for memory built-in self-test circuit and scan flip-slops) The gate count of STS DCT without DFT insertion is 122 K and the hardware efficiency of STS DCT is equal to 1369 that is the largest in the comparison Table VII On the other hand, the proposed 2-D DCT achieves 52 db PSNR value, that is the medium accuracy compared with other pervious works, verified by the test flow in Fig 7 The power consumption is another issue in the circuit designs There are many different technologies to implement DCT core in previous works In order to have a fair comparison, the power is normalized to 018- m technology process Similar to [23] as listed in (17) and is also summarized in Table VII The proposed 2-D DCT core has medium power consumption between pervious works Normalized Power (17)

8 662 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL 20, NO 4, APRIL 2012 Fig 8 Compressed image using JPEG compression processing with the proposed 2-D DCT core (a) Original image (b) Compressed image with QF = 99 TABLE VII COMPARISON OF DIFFERENT 2-D DCT ARCHITECTURES WITH THE PROPOSED ARCHITECTURE Four transistors per NAND2 gate for different technology 5=(): Where 5 is the chip measurement result with DFT insertion and is simulation result without DFT insertion The power result is estimated for 1-D DCT operation C Chip Implementation Implemented in a 18-V TSMC 018- m 1P6M CMOS process, the proposed D DCT core uses the Synopsys Design Compiler to synthesize the RTL code and the Cadence SoC Encounter for placement and routing (P&R) The proposed DCT core has a latency of 84 clock cycles and is operated at 167 MHz to meet HDTV specifications For the design for test (DFT) considerations, the proposed core uses the SynTest SRAMBIST to generate the memory built-in self-test (BIST) circuit and the Synopsys Design Compiler to synthesize scan flip-flops, and the total gate counts with DFT insertion of the proposed core is 158 K The photomicrograph of the proposed core with marking each module s boundary is shown in Fig 9 and the measured characteristics are listed in Table VIII D FPGA Implementation The proposed 2-D DCT core synthesized using Xilinx ISE 101 and Xilinx XC2VP30 FPGA can be operated at a clock frequency of 110 MHz Table IX shows a comparison of the proposed 2-D DCT core with previous FPGA implementations The multiplier-less coordinate rotation digital computer (CORDIC) architecture was presented in [15] Sun et al utilized 120 adders and 80 shifters performing a 2-D quantized discrete cosine integer transform (QDCIT) to achieve a high clock rate of 149 MHz, but with large FPGA area resources Also, a 2-D DCT core that was optimized in both delay and area was addressed by Tumeo et al [14], who presented a balance point between the delay and the area for multiplier-based pipeline DCT design with 19 adders/subtracters and 4 multipliers in their DCT core

9 CHEN AND CHANG: HIGH PERFORMANCE VIDEO TRANSFORM ENGINE 663 saving in area over the NEDA architecture for a DA-based DCT design Furthermore, the proposed time scheduling strategy arranges the computation time for each process element The 1-D core can calculate the 1st-D and 2nd-D transformations simultaneously, thereby achieving a high throughput rate Therefore, a high performance D DCT core utilizing high-accuracy, a small-area, and a high-throughput rate has been achieved using the proposed STS strategy Finally, the proposed high performance 2-D DCT core is fabricated using a TSMC 018-1P6M CMOS process and implemented in the XC2VP30 FPGA Fig 9 Photomicrograph of the proposed DCT core TABLE VIII CHIP CHARACTERISTICS OF THE PROPOSED 2-D DCT ARCHITECTURE TABLE IX COMPARISONS OF 2-D DCT ARCHITECTURES IN FPGAS operated at clock frequency of 107 MHz The proposed DCT core has lower area resources and a medium operation frequency in the XC2VP30 FPGA implementations V CONCLUSION This paper proposes a high performance video transform engine using the STS strategy The proposed 2-D DCT core employs a single 1-D DCT core and one TMEM with a small area Based on the test image simulations, 9-bit DA-precision is chosen in order to meet the PSNR requirements, and the hardware sharing architecture enables the arithmetic resources to be shared based on time so as to reduce area cost Hence, the number of adders/substracters in 1-D DCT core allows a 74% REFERENCES [1] Y Wang, J Ostermann, and Y Zhang, Video Processing and Communications, 1st ed Englewood Cliffs, NJ: Prentice-Hall, 2002 [2] E Feig and S Winograd, Fast algorithms for the discrete cosine transform, IEEE Trans Signal Process, vol 40, no 9, pp , Sep 1992 [3] Y P Lee, T H Chen, L G Chen, M J Chen, and C W Ku, A cost-effective architecture for two-dimensional DCT/IDCT using direct method, IEEE Trans Circuits Syst Video Technol, vol 7, no 3, pp , Jun 1997 [4] C T Lin, Y C Yu, and L D Van, Cost-effective triple-mode reconfigurable pipeline FFT/IFFT/2-D DCT processor, IEEE Trans Very Large Scale Integr (VLSI) Syst, vol 16, no 8, pp , Aug 2008 [5] M Alam, W Badawy, and G Jullien, A new time distributed DCT architecture for MPEG-4 hardware reference model, IEEE Trans Circuits Syst Video Technol, vol 15, no 5, pp , May 2005 [6] T Xanthopoulos and A P Chandrakasan, A low-power DCT core using adaptive bitwidth and arithmetic activity exploiting signal correlations and quantization, IEEE J Solid-State Circuits, vol 35, no 5, pp , May 2000 [7] A M Shams, A Chidanandan, W Pan, and M A Bayoumi, NEDA: A low-power high-performance DCT architecture, IEEE Trans Signal Process, vol 54, no 3, pp , Mar 2006 [8] C Peng, X Cao, D Yu, and X Zhang, A 250 MHz optimized distributed architecture of 2D DCT, in Proc Int Conf ASIC, 2007, pp [9] C Y Huang, L F Chen, and Y K Lai, A high-speed 2-D transform architecture with unique kernel for multi-standard video applications, in Proc IEEE Int Symp Circuits Syst, 2008, pp [10] S C Hsia and S H Wang, Shift-register-based data transposition for cost-effective discrete cosine transform, IEEE Trans Very Large Scale Integr (VLSI) Syst, vol 15, no 6, pp , Jun 2007 [11] J I Guo, R C Ju, and J W Chen, An efficient 2-D DCT/IDCT core design using cyclic convolution and adder-based realization, IEEE Trans Circuits Syst Video Technol, vol 14, no 4, pp , Apr 2004 [12] H C Hsu, K B Lee, N Y Chang, and T S Chang, Architecture design of shape-adaptive discrete cosine transform and its inverse for MPEG-4 video coding, IEEE Trans Circuits Syst Video Technol, vol 18, no 3, pp , Mar 2008 [13] A Madisetti and A N Willson, A 100 MHz 2-D DCT/IDCT processor for HDTV applications, IEEE Trans Circuits Syst Video Technol, vol 5, no 2, pp , Apr 1995 [14] A Tumeo, M Monchiero, G Palermo, F Ferrandi, and D Sciuto, A pipelined fast 2D-DCT accelerator for FPGA-based SOCs, in Proc IEEE Comput Soc Annu Symp VLSI, 2007, pp [15] C C Sun, P Donner, and J Gotze, Low-complexity multi-purpose IP core for quantized discrete cosine and integer transform, in Proc IEEE Int Symp Circuits Syst, 2009, pp [16] S Ghosh, S Venigalla, and M Bayoumi, Design and implementaion of a 2D-DCT architecture using coefficient distributed arithmetic, in Proc IEEE Comput Soc Annu Symp VLSI, 2005, pp [17] L V Agostini, I S Silva, and S Bampi, Pipelined fast 2D DCT architecture for JPEG image compression, in Proc IEEE Symp Integr Circuits Syst Des, 2001, pp [18] C H Chang, C L Wang, and Y T Chang, Efficient VLSI architectures for fast computation of the discrete Fourier transform and its inverse, IEEE Trans Signal Process, vol 48, no 11, pp , Nov 2000

10 664 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL 20, NO 4, APRIL 2012 [19] Y H Chen, T Y Chang, and C Y Li, High throughput DA-based DCT with high accuracy error-compensated adder tree, IEEE Trans Very Large Scale Integr (VLSI) Syst, 2010, to be published [20] W Pan, A fast 2-D DCT algorithm via distributed arithmetic optimization, in Proc IEEE Int Conf Image Process, 2000, vol 3, pp [21] Information Technology Digital Compression and Coding of Continuous-Tone Still Images: Requirements and Guidelines, ISO/IEC , 1991 [22] W B Pennebaker and J L Mitchell, JPEG Still Image Data Compression Standard New York: Van Nostrand Reinhold, 1992 [23] S N Tang, J W Tsai, and T Y Chang, A 24-Gs/s FFT processor for OFDM-based WPAN applications, IEEE Trans Circuits Syst II, Exp Briefs, vol 57, no 6, pp , Jun 2010 Tsin-Yuan Chang (S 87 M 90) received the BS degree in electrical engineering from National Tsing Hua University, Hsinchu, Taiwan, in 1982, and the MS and PhD degrees from Michigan State University, East Lansing, in 1987 and 1989, respectively, both in electrical engineering He is an Associate Professor with the Department of Electrical Engineering, National Tsing Hua University His research interests include IC design and testing, Computer arithmetic, and VLSI DSP Yuan-Ho Chen (S 10) received the BS degree from Chang Gung University, Taoyuan, Taiwan, in 2002, and the MS degree from the National Tsing Hua University (NTHU), Hsinchu, Taiwan, in 2004, where he is currently pursuing the PhD degree in electrical engineering His research interests include video/image and digital signal processing, VLSI architecture design, VLSI implementation, computer arithmetic, and signal processing for wireless communication

DUE to the high computational complexity and real-time

DUE to the high computational complexity and real-time IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 15, NO. 3, MARCH 2005 445 A Memory-Efficient Realization of Cyclic Convolution and Its Application to Discrete Cosine Transform Hun-Chen

More information

RECENTLY, researches on gigabit wireless personal area

RECENTLY, researches on gigabit wireless personal area 146 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 55, NO. 2, FEBRUARY 2008 An Indexed-Scaling Pipelined FFT Processor for OFDM-Based WPAN Applications Yuan Chen, Student Member, IEEE,

More information

THE orthogonal frequency-division multiplex (OFDM)

THE orthogonal frequency-division multiplex (OFDM) 26 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 57, NO. 1, JANUARY 2010 A Generalized Mixed-Radix Algorithm for Memory-Based FFT Processors Chen-Fong Hsiao, Yuan Chen, Member, IEEE,

More information

FPGA Implementation of 16-Point Radix-4 Complex FFT Core Using NEDA

FPGA Implementation of 16-Point Radix-4 Complex FFT Core Using NEDA FPGA Implementation of 16-Point FFT Core Using NEDA Abhishek Mankar, Ansuman Diptisankar Das and N Prasad Abstract--NEDA is one of the techniques to implement many digital signal processing systems that

More information

The Serial Commutator FFT

The Serial Commutator FFT The Serial Commutator FFT Mario Garrido Gálvez, Shen-Jui Huang, Sau-Gee Chen and Oscar Gustafsson Journal Article N.B.: When citing this work, cite the original article. 2016 IEEE. Personal use of this

More information

Pipelined Fast 2-D DCT Architecture for JPEG Image Compression

Pipelined Fast 2-D DCT Architecture for JPEG Image Compression Pipelined Fast 2-D DCT Architecture for JPEG Image Compression Luciano Volcan Agostini agostini@inf.ufrgs.br Ivan Saraiva Silva* ivan@dimap.ufrn.br *Federal University of Rio Grande do Norte DIMAp - Natal

More information

FAST FOURIER TRANSFORM (FFT) and inverse fast

FAST FOURIER TRANSFORM (FFT) and inverse fast IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 39, NO. 11, NOVEMBER 2004 2005 A Dynamic Scaling FFT Processor for DVB-T Applications Yu-Wei Lin, Hsuan-Yu Liu, and Chen-Yi Lee Abstract This paper presents an

More information

AMONG various transform techniques for image compression,

AMONG various transform techniques for image compression, IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 7, NO. 3, JUNE 1997 459 A Cost-Effective Architecture for 8 8 Two-Dimensional DCT/IDCT Using Direct Method Yung-Pin Lee, Student Member,

More information

Area and Power efficient MST core supported video codec using CSDA

Area and Power efficient MST core supported video codec using CSDA International Journal of Science, Engineering and Technology Research (IJSETR), Volume 4, Issue 6, June 0 Area and Power efficient MST core supported video codec using A B.Sutha Sivakumari*, B.Mohan**

More information

DESIGN OF DCT ARCHITECTURE USING ARAI ALGORITHMS

DESIGN OF DCT ARCHITECTURE USING ARAI ALGORITHMS DESIGN OF DCT ARCHITECTURE USING ARAI ALGORITHMS Prerana Ajmire 1, A.B Thatere 2, Shubhangi Rathkanthivar 3 1,2,3 Y C College of Engineering, Nagpur, (India) ABSTRACT Nowadays the demand for applications

More information

A Normal I/O Order Radix-2 FFT Architecture to Process Twin Data Streams for MIMO

A Normal I/O Order Radix-2 FFT Architecture to Process Twin Data Streams for MIMO 2402 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 24, NO. 6, JUNE 2016 A Normal I/O Order Radix-2 FFT Architecture to Process Twin Data Streams for MIMO Antony Xavier Glittas,

More information

Parallel-computing approach for FFT implementation on digital signal processor (DSP)

Parallel-computing approach for FFT implementation on digital signal processor (DSP) Parallel-computing approach for FFT implementation on digital signal processor (DSP) Yi-Pin Hsu and Shin-Yu Lin Abstract An efficient parallel form in digital signal processor can improve the algorithm

More information

DESIGN METHODOLOGY. 5.1 General

DESIGN METHODOLOGY. 5.1 General 87 5 FFT DESIGN METHODOLOGY 5.1 General The fast Fourier transform is used to deliver a fast approach for the processing of data in the wireless transmission. The Fast Fourier Transform is one of the methods

More information

A VLSI Architecture for H.264/AVC Variable Block Size Motion Estimation

A VLSI Architecture for H.264/AVC Variable Block Size Motion Estimation Journal of Automation and Control Engineering Vol. 3, No. 1, February 20 A VLSI Architecture for H.264/AVC Variable Block Size Motion Estimation Dam. Minh Tung and Tran. Le Thang Dong Center of Electrical

More information

FPGA IMPLEMENTATION OF HIGH SPEED DCT COMPUTATION OF JPEG USING VEDIC MULTIPLIER

FPGA IMPLEMENTATION OF HIGH SPEED DCT COMPUTATION OF JPEG USING VEDIC MULTIPLIER FPGA IMPLEMENTATION OF HIGH SPEED DCT COMPUTATION OF JPEG USING VEDIC MULTIPLIER Prasannkumar Sohani Department of Electronics Shivaji University, Kolhapur, Maharashtra, India P.C.Bhaskar Department of

More information

Design and Implementation of Effective Architecture for DCT with Reduced Multipliers

Design and Implementation of Effective Architecture for DCT with Reduced Multipliers Design and Implementation of Effective Architecture for DCT with Reduced Multipliers Susmitha. Remmanapudi & Panguluri Sindhura Dept. of Electronics and Communications Engineering, SVECW Bhimavaram, Andhra

More information

A Novel Discrete cosine transforms & Distributed arithmetic

A Novel Discrete cosine transforms & Distributed arithmetic A Novel Discrete cosine transforms & Distributed arithmetic Miss.M Ramadevi 1 &Mr. R. Srinivasa Rao 2 1 M. Tech Dept. VLSI in Khammam Institute of Technology and Sciences, Khammam District 2 Associate

More information

Keywords: Fast Fourier Transforms (FFT), Multipath Delay Commutator (MDC), Pipelined Architecture, Radix-2 k, VLSI.

Keywords: Fast Fourier Transforms (FFT), Multipath Delay Commutator (MDC), Pipelined Architecture, Radix-2 k, VLSI. ww.semargroup.org www.ijvdcs.org ISSN 2322-0929 Vol.02, Issue.05, August-2014, Pages:0294-0298 Radix-2 k Feed Forward FFT Architectures K.KIRAN KUMAR 1, M.MADHU BABU 2 1 PG Scholar, Dept of VLSI & ES,

More information

FPGA Implementation of 2-D DCT Architecture for JPEG Image Compression

FPGA Implementation of 2-D DCT Architecture for JPEG Image Compression FPGA Implementation of 2-D DCT Architecture for JPEG Image Compression Prashant Chaturvedi 1, Tarun Verma 2, Rita Jain 3 1 Department of Electronics & Communication Engineering Lakshmi Narayan College

More information

FIR Filter Architecture for Fixed and Reconfigurable Applications

FIR Filter Architecture for Fixed and Reconfigurable Applications FIR Filter Architecture for Fixed and Reconfigurable Applications Nagajyothi 1,P.Sayannna 2 1 M.Tech student, Dept. of ECE, Sudheer reddy college of Engineering & technology (w), Telangana, India 2 Assosciate

More information

/$ IEEE

/$ IEEE IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 56, NO. 1, JANUARY 2009 81 Bit-Level Extrinsic Information Exchange Method for Double-Binary Turbo Codes Ji-Hoon Kim, Student Member,

More information

Analysis of Radix- SDF Pipeline FFT Architecture in VLSI Using Chip Scope

Analysis of Radix- SDF Pipeline FFT Architecture in VLSI Using Chip Scope Analysis of Radix- SDF Pipeline FFT Architecture in VLSI Using Chip Scope G. Mohana Durga 1, D.V.R. Mohan 2 1 M.Tech Student, 2 Professor, Department of ECE, SRKR Engineering College, Bhimavaram, Andhra

More information

IJSRD - International Journal for Scientific Research & Development Vol. 4, Issue 05, 2016 ISSN (online):

IJSRD - International Journal for Scientific Research & Development Vol. 4, Issue 05, 2016 ISSN (online): IJSRD - International Journal for Scientific Research & Development Vol. 4, Issue 05, 2016 ISSN (online): 2321-0613 A Reconfigurable and Scalable Architecture for Discrete Cosine Transform Maitra S Aldi

More information

Joint Optimization of Low-power DCT Architecture and Efficient Quantization Technique for Embedded Image Compression

Joint Optimization of Low-power DCT Architecture and Efficient Quantization Technique for Embedded Image Compression Joint Optimization of Low-power DCT Architecture and Efficient Quantization Technique for Embedded Image Compression Maher Jridi and Ayman Alfalou L@bISen Laboratory, Equipe Vision ISEN-Brest CS 42807

More information

Speed Optimised CORDIC Based Fast Algorithm for DCT

Speed Optimised CORDIC Based Fast Algorithm for DCT GRD Journals Global Research and Development Journal for Engineering International Conference on Innovations in Engineering and Technology (ICIET) - 2016 July 2016 e-issn: 2455-5703 Speed Optimised CORDIC

More information

Design and Implementation of Lifting Based Two Dimensional Discrete Wavelet Transform

Design and Implementation of Lifting Based Two Dimensional Discrete Wavelet Transform Design and Implementation of Lifting Based Two Dimensional Discrete Wavelet Transform Yamuna 1, Dr.Deepa Jose 2, R.Rajagopal 3 1 Department of Electronics and Communication engineering, Centre for Excellence

More information

A Ripple Carry Adder based Low Power Architecture of LMS Adaptive Filter

A Ripple Carry Adder based Low Power Architecture of LMS Adaptive Filter A Ripple Carry Adder based Low Power Architecture of LMS Adaptive Filter A.S. Sneka Priyaa PG Scholar Government College of Technology Coimbatore ABSTRACT The Least Mean Square Adaptive Filter is frequently

More information

Three-D DWT of Efficient Architecture

Three-D DWT of Efficient Architecture Bonfring International Journal of Advances in Image Processing, Vol. 1, Special Issue, December 2011 6 Three-D DWT of Efficient Architecture S. Suresh, K. Rajasekhar, M. Venugopal Rao, Dr.B.V. Rammohan

More information

FPGA IMPLEMENTATION FOR REAL TIME SOBEL EDGE DETECTOR BLOCK USING 3-LINE BUFFERS

FPGA IMPLEMENTATION FOR REAL TIME SOBEL EDGE DETECTOR BLOCK USING 3-LINE BUFFERS FPGA IMPLEMENTATION FOR REAL TIME SOBEL EDGE DETECTOR BLOCK USING 3-LINE BUFFERS 1 RONNIE O. SERFA JUAN, 2 CHAN SU PARK, 3 HI SEOK KIM, 4 HYEONG WOO CHA 1,2,3,4 CheongJu University E-maul: 1 engr_serfs@yahoo.com,

More information

HIGH-PERFORMANCE RECONFIGURABLE FIR FILTER USING PIPELINE TECHNIQUE

HIGH-PERFORMANCE RECONFIGURABLE FIR FILTER USING PIPELINE TECHNIQUE HIGH-PERFORMANCE RECONFIGURABLE FIR FILTER USING PIPELINE TECHNIQUE Anni Benitta.M #1 and Felcy Jeba Malar.M *2 1# Centre for excellence in VLSI Design, ECE, KCG College of Technology, Chennai, Tamilnadu

More information

An HEVC Fractional Interpolation Hardware Using Memory Based Constant Multiplication

An HEVC Fractional Interpolation Hardware Using Memory Based Constant Multiplication 2018 IEEE International Conference on Consumer Electronics (ICCE) An HEVC Fractional Interpolation Hardware Using Memory Based Constant Multiplication Ahmet Can Mert, Ercan Kalali, Ilker Hamzaoglu Faculty

More information

Memory-Efficient and High-Speed Line-Based Architecture for 2-D Discrete Wavelet Transform with Lifting Scheme

Memory-Efficient and High-Speed Line-Based Architecture for 2-D Discrete Wavelet Transform with Lifting Scheme Proceedings of the 7th WSEAS International Conference on Multimedia Systems & Signal Processing, Hangzhou, China, April 5-7, 007 3 Memory-Efficient and High-Speed Line-Based Architecture for -D Discrete

More information

Low Power and Memory Efficient FFT Architecture Using Modified CORDIC Algorithm

Low Power and Memory Efficient FFT Architecture Using Modified CORDIC Algorithm Low Power and Memory Efficient FFT Architecture Using Modified CORDIC Algorithm 1 A.Malashri, 2 C.Paramasivam 1 PG Student, Department of Electronics and Communication K S Rangasamy College Of Technology,

More information

VLSI Implementation of Low Power Area Efficient FIR Digital Filter Structures Shaila Khan 1 Uma Sharma 2

VLSI Implementation of Low Power Area Efficient FIR Digital Filter Structures Shaila Khan 1 Uma Sharma 2 IJSRD - International Journal for Scientific Research & Development Vol. 3, Issue 05, 2015 ISSN (online): 2321-0613 VLSI Implementation of Low Power Area Efficient FIR Digital Filter Structures Shaila

More information

Low Power Complex Multiplier based FFT Processor

Low Power Complex Multiplier based FFT Processor Low Power Complex Multiplier based FFT Processor V.Sarada, Dr.T.Vigneswaran 2 ECE, SRM University, Chennai,India saradasaran@gmail.com 2 ECE, VIT University, Chennai,India vigneshvlsi@gmail.com Abstract-

More information

MANY image and video compression standards such as

MANY image and video compression standards such as 696 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL 9, NO 5, AUGUST 1999 An Efficient Method for DCT-Domain Image Resizing with Mixed Field/Frame-Mode Macroblocks Changhoon Yim and

More information

Implementation of Pipelined Architecture Based on the DCT and Quantization For JPEG Image Compression

Implementation of Pipelined Architecture Based on the DCT and Quantization For JPEG Image Compression Volume 01, No. 01 www.semargroups.org Jul-Dec 2012, P.P. 60-66 Implementation of Pipelined Architecture Based on the DCT and Quantization For JPEG Image Compression A.PAVANI 1,C.HEMASUNDARA RAO 2,A.BALAJI

More information

Linköping University Post Print. Analysis of Twiddle Factor Memory Complexity of Radix-2^i Pipelined FFTs

Linköping University Post Print. Analysis of Twiddle Factor Memory Complexity of Radix-2^i Pipelined FFTs Linköping University Post Print Analysis of Twiddle Factor Complexity of Radix-2^i Pipelined FFTs Fahad Qureshi and Oscar Gustafsson N.B.: When citing this work, cite the original article. 200 IEEE. Personal

More information

High Speed Special Function Unit for Graphics Processing Unit

High Speed Special Function Unit for Graphics Processing Unit High Speed Special Function Unit for Graphics Processing Unit Abd-Elrahman G. Qoutb 1, Abdullah M. El-Gunidy 1, Mohammed F. Tolba 1, and Magdy A. El-Moursy 2 1 Electrical Engineering Department, Fayoum

More information

VLSI Architecture to Detect/Correct Errors in Motion Estimation Using Biresidue Codes

VLSI Architecture to Detect/Correct Errors in Motion Estimation Using Biresidue Codes VLSI Architecture to Detect/Correct Errors in Motion Estimation Using Biresidue Codes Harsha Priya. M 1, Jyothi Kamatam 2, Y. Aruna Suhasini Devi 3 1,2 Assistant Professor, 3 Associate Professor, Department

More information

A Low Power Asynchronous FPGA with Autonomous Fine Grain Power Gating and LEDR Encoding

A Low Power Asynchronous FPGA with Autonomous Fine Grain Power Gating and LEDR Encoding A Low Power Asynchronous FPGA with Autonomous Fine Grain Power Gating and LEDR Encoding N.Rajagopala krishnan, k.sivasuparamanyan, G.Ramadoss Abstract Field Programmable Gate Arrays (FPGAs) are widely

More information

DESIGN & SIMULATION PARALLEL PIPELINED RADIX -2^2 FFT ARCHITECTURE FOR REAL VALUED SIGNALS

DESIGN & SIMULATION PARALLEL PIPELINED RADIX -2^2 FFT ARCHITECTURE FOR REAL VALUED SIGNALS DESIGN & SIMULATION PARALLEL PIPELINED RADIX -2^2 FFT ARCHITECTURE FOR REAL VALUED SIGNALS Madhavi S.Kapale #1, Prof.Nilesh P. Bodne #2 1 Student Mtech Electronics Engineering (Communication) 2 Assistant

More information

Design and Implementation of 3-D DWT for Video Processing Applications

Design and Implementation of 3-D DWT for Video Processing Applications Design and Implementation of 3-D DWT for Video Processing Applications P. Mohaniah 1, P. Sathyanarayana 2, A. S. Ram Kumar Reddy 3 & A. Vijayalakshmi 4 1 E.C.E, N.B.K.R.IST, Vidyanagar, 2 E.C.E, S.V University

More information

Fault Tolerant Parallel Filters Based on ECC Codes

Fault Tolerant Parallel Filters Based on ECC Codes Advances in Computational Sciences and Technology ISSN 0973-6107 Volume 11, Number 7 (2018) pp. 597-605 Research India Publications http://www.ripublication.com Fault Tolerant Parallel Filters Based on

More information

An efficient multiplierless approximation of the fast Fourier transform using sum-of-powers-of-two (SOPOT) coefficients

An efficient multiplierless approximation of the fast Fourier transform using sum-of-powers-of-two (SOPOT) coefficients Title An efficient multiplierless approximation of the fast Fourier transm using sum-of-powers-of-two (SOPOT) coefficients Author(s) Chan, SC; Yiu, PM Citation Ieee Signal Processing Letters, 2002, v.

More information

Research Article International Journal of Emerging Research in Management &Technology ISSN: (Volume-6, Issue-8) Abstract:

Research Article International Journal of Emerging Research in Management &Technology ISSN: (Volume-6, Issue-8) Abstract: International Journal of Emerging Research in Management &Technology Research Article August 27 Design and Implementation of Fast Fourier Transform (FFT) using VHDL Code Akarshika Singhal, Anjana Goen,

More information

MULTIPLIERLESS HIGH PERFORMANCE FFT COMPUTATION

MULTIPLIERLESS HIGH PERFORMANCE FFT COMPUTATION MULTIPLIERLESS HIGH PERFORMANCE FFT COMPUTATION Maheshwari.U 1, Josephine Sugan Priya. 2, 1 PG Student, Dept Of Communication Systems Engg, Idhaya Engg. College For Women, 2 Asst Prof, Dept Of Communication

More information

FPGA Implementation of Multiplierless 2D DWT Architecture for Image Compression

FPGA Implementation of Multiplierless 2D DWT Architecture for Image Compression FPGA Implementation of Multiplierless 2D DWT Architecture for Image Compression Divakara.S.S, Research Scholar, J.S.S. Research Foundation, Mysore Cyril Prasanna Raj P Dean(R&D), MSEC, Bangalore Thejas

More information

Architectures for Dynamic Data Scaling in 2/4/8K Pipeline FFT Cores

Architectures for Dynamic Data Scaling in 2/4/8K Pipeline FFT Cores Architectures for Dynamic Data Scaling in 2/4/8K Pipeline FFT Cores Lenart, Thomas; Öwall, Viktor Published in: IEEE Transactions on Very Large Scale Integration (VLSI) Systems DOI: 10.1109/TVLSI.2006.886407

More information

II. MOTIVATION AND IMPLEMENTATION

II. MOTIVATION AND IMPLEMENTATION An Efficient Design of Modified Booth Recoder for Fused Add-Multiply operator Dhanalakshmi.G Applied Electronics PSN College of Engineering and Technology Tirunelveli dhanamgovind20@gmail.com Prof.V.Gopi

More information

Image Compression System on an FPGA

Image Compression System on an FPGA Image Compression System on an FPGA Group 1 Megan Fuller, Ezzeldin Hamed 6.375 Contents 1 Objective 2 2 Background 2 2.1 The DFT........................................ 3 2.2 The DCT........................................

More information

Design of 2-D DWT VLSI Architecture for Image Processing

Design of 2-D DWT VLSI Architecture for Image Processing Design of 2-D DWT VLSI Architecture for Image Processing Betsy Jose 1 1 ME VLSI Design student Sri Ramakrishna Engineering College, Coimbatore B. Sathish Kumar 2 2 Assistant Professor, ECE Sri Ramakrishna

More information

f. ws V r.» ««w V... V, 'V. v...

f. ws V r.» ««w V... V, 'V. v... M. SV V 'Vy' i*-- V.J ". -. '. j 1. vv f. ws. v wn V r.» ««w V... V, 'V. v... --

More information

Fixed Point Streaming Fft Processor For Ofdm

Fixed Point Streaming Fft Processor For Ofdm Fixed Point Streaming Fft Processor For Ofdm Sudhir Kumar Sa Rashmi Panda Aradhana Raju Abstract Fast Fourier Transform (FFT) processors are today one of the most important blocks in communication systems.

More information

ISSN Vol.02, Issue.11, December-2014, Pages:

ISSN Vol.02, Issue.11, December-2014, Pages: ISSN 2322-0929 Vol.02, Issue.11, December-2014, Pages:1119-1123 www.ijvdcs.org High Speed and Area Efficient Radix-2 2 Feed Forward FFT Architecture ARRA ASHOK 1, S.N.CHANDRASHEKHAR 2 1 PG Scholar, Dept

More information

A full-pipelined 2-D IDCT/ IDST VLSI architecture with adaptive block-size for HEVC standard

A full-pipelined 2-D IDCT/ IDST VLSI architecture with adaptive block-size for HEVC standard LETTER IEICE Electronics Express, Vol.10, No.9, 1 11 A full-pipelined 2-D IDCT/ IDST VLSI architecture with adaptive block-size for HEVC standard Hong Liang a), He Weifeng b), Zhu Hui, and Mao Zhigang

More information

Implementation of FFT Processor using Urdhva Tiryakbhyam Sutra of Vedic Mathematics

Implementation of FFT Processor using Urdhva Tiryakbhyam Sutra of Vedic Mathematics Implementation of FFT Processor using Urdhva Tiryakbhyam Sutra of Vedic Mathematics Yojana Jadhav 1, A.P. Hatkar 2 PG Student [VLSI & Embedded system], Dept. of ECE, S.V.I.T Engineering College, Chincholi,

More information

Novel design of multiplier-less FFT processors

Novel design of multiplier-less FFT processors Signal Processing 8 (00) 140 140 www.elsevier.com/locate/sigpro Novel design of multiplier-less FFT processors Yuan Zhou, J.M. Noras, S.J. Shepherd School of EDT, University of Bradford, Bradford, West

More information

120 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 61, NO. 2, FEBRUARY 2014

120 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 61, NO. 2, FEBRUARY 2014 120 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 61, NO. 2, FEBRUARY 2014 VL-ECC: Variable Data-Length Error Correction Code for Embedded Memory in DSP Applications Jangwon Park,

More information

A Pipelined Fast 2D-DCT Accelerator for FPGA-based SoCs

A Pipelined Fast 2D-DCT Accelerator for FPGA-based SoCs A Pipelined Fast 2D-DCT Accelerator for FPGA-based SoCs Antonino Tumeo, Matteo Monchiero, Gianluca Palermo, Fabrizio Ferrandi, Donatella Sciuto Politecnico di Milano, Dipartimento di Elettronica e Informazione

More information

An Efficient High Speed VLSI Architecture Based 16-Point Adaptive Split Radix-2 FFT Architecture

An Efficient High Speed VLSI Architecture Based 16-Point Adaptive Split Radix-2 FFT Architecture IJSTE - International Journal of Science Technology & Engineering Volume 2 Issue 10 April 2016 ISSN (online): 2349-784X An Efficient High Speed VLSI Architecture Based 16-Point Adaptive Split Radix-2 FFT

More information

Fast Wavelet-based Macro-block Selection Algorithm for H.264 Video Codec

Fast Wavelet-based Macro-block Selection Algorithm for H.264 Video Codec Proceedings of the International MultiConference of Engineers and Computer Scientists 8 Vol I IMECS 8, 19-1 March, 8, Hong Kong Fast Wavelet-based Macro-block Selection Algorithm for H.64 Video Codec Shi-Huang

More information

Express Letters. A Simple and Efficient Search Algorithm for Block-Matching Motion Estimation. Jianhua Lu and Ming L. Liou

Express Letters. A Simple and Efficient Search Algorithm for Block-Matching Motion Estimation. Jianhua Lu and Ming L. Liou IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 7, NO. 2, APRIL 1997 429 Express Letters A Simple and Efficient Search Algorithm for Block-Matching Motion Estimation Jianhua Lu and

More information

Implementation of Efficient Modified Booth Recoder for Fused Sum-Product Operator

Implementation of Efficient Modified Booth Recoder for Fused Sum-Product Operator Implementation of Efficient Modified Booth Recoder for Fused Sum-Product Operator A.Sindhu 1, K.PriyaMeenakshi 2 PG Student [VLSI], Dept. of ECE, Muthayammal Engineering College, Rasipuram, Tamil Nadu,

More information

ARITHMETIC operations based on residue number systems

ARITHMETIC operations based on residue number systems IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 53, NO. 2, FEBRUARY 2006 133 Improved Memoryless RNS Forward Converter Based on the Periodicity of Residues A. B. Premkumar, Senior Member,

More information

ISSN (Online), Volume 1, Special Issue 2(ICITET 15), March 2015 International Journal of Innovative Trends and Emerging Technologies

ISSN (Online), Volume 1, Special Issue 2(ICITET 15), March 2015 International Journal of Innovative Trends and Emerging Technologies VLSI IMPLEMENTATION OF HIGH PERFORMANCE DISTRIBUTED ARITHMETIC (DA) BASED ADAPTIVE FILTER WITH FAST CONVERGENCE FACTOR G. PARTHIBAN 1, P.SATHIYA 2 PG Student, VLSI Design, Department of ECE, Surya Group

More information

ON THE LOW-POWER DESIGN OF DCT AND IDCT FOR LOW BIT-RATE VIDEO CODECS

ON THE LOW-POWER DESIGN OF DCT AND IDCT FOR LOW BIT-RATE VIDEO CODECS ON THE LOW-POWER DESIGN OF DCT AND FOR LOW BIT-RATE VIDEO CODECS Nathaniel August Intel Corporation M/S JF3-40 2 N.E. 25th Avenue Hillsboro, OR 9724 E-mail: nathaniel.j.august@intel.com Dong Sam Ha Virginia

More information

FFT/IFFTProcessor IP Core Datasheet

FFT/IFFTProcessor IP Core Datasheet System-on-Chip engineering FFT/IFFTProcessor IP Core Datasheet - Released - Core:120801 Doc: 130107 This page has been intentionally left blank ii Copyright reminder Copyright c 2012 by System-on-Chip

More information

AN FFT PROCESSOR BASED ON 16-POINT MODULE

AN FFT PROCESSOR BASED ON 16-POINT MODULE AN FFT PROCESSOR BASED ON 6-POINT MODULE Weidong Li, Mark Vesterbacka and Lars Wanhammar Electronics Systems, Dept. of EE., Linköping University SE-58 8 LINKÖPING, SWEDEN E-mail: {weidongl, markv, larsw}@isy.liu.se,

More information

A SIMULINK-TO-FPGA MULTI-RATE HIERARCHICAL FIR FILTER DESIGN

A SIMULINK-TO-FPGA MULTI-RATE HIERARCHICAL FIR FILTER DESIGN A SIMULINK-TO-FPGA MULTI-RATE HIERARCHICAL FIR FILTER DESIGN Xiaoying Li 1 Fuming Sun 2 Enhua Wu 1, 3 1 University of Macau, Macao, China 2 University of Science and Technology Beijing, Beijing, China

More information

Fast Decision of Block size, Prediction Mode and Intra Block for H.264 Intra Prediction EE Gaurav Hansda

Fast Decision of Block size, Prediction Mode and Intra Block for H.264 Intra Prediction EE Gaurav Hansda Fast Decision of Block size, Prediction Mode and Intra Block for H.264 Intra Prediction EE 5359 Gaurav Hansda 1000721849 gaurav.hansda@mavs.uta.edu Outline Introduction to H.264 Current algorithms for

More information

FPGA Implementation of Low Complexity Video Encoder using Optimized 3D-DCT

FPGA Implementation of Low Complexity Video Encoder using Optimized 3D-DCT FPGA Implementation of Low Complexity Video Encoder using Optimized 3D-DCT Rajalekshmi R Embedded Systems Sree Buddha College of Engineering, Pattoor India Arya Lekshmi M Electronics and Communication

More information

A Parallel Reconfigurable Architecture for DCT of Lengths N=32/16/8

A Parallel Reconfigurable Architecture for DCT of Lengths N=32/16/8 Page20 A Parallel Reconfigurable Architecture for DCT of Lengths N=32/16/8 ABSTRACT: Parthiban K G* & Sabin.A.B ** * Professor, M.P. Nachimuthu M. Jaganathan Engineering College, Erode, India ** PG Scholar,

More information

IMPLEMENTATION OF OPTIMIZED 128-POINT PIPELINE FFT PROCESSOR USING MIXED RADIX 4-2 FOR OFDM APPLICATIONS

IMPLEMENTATION OF OPTIMIZED 128-POINT PIPELINE FFT PROCESSOR USING MIXED RADIX 4-2 FOR OFDM APPLICATIONS IMPLEMENTATION OF OPTIMIZED 128-POINT PIPELINE FFT PROCESSOR USING MIXED RADIX 4-2 FOR OFDM APPLICATIONS K. UMAPATHY, Research scholar, Department of ECE, Jawaharlal Nehru Technological University, Anantapur,

More information

Power-Mode-Aware Buffer Synthesis for Low-Power Clock Skew Minimization

Power-Mode-Aware Buffer Synthesis for Low-Power Clock Skew Minimization This article has been accepted and published on J-STAGE in advance of copyediting. Content is final as presented. IEICE Electronics Express, Vol.* No.*,*-* Power-Mode-Aware Buffer Synthesis for Low-Power

More information

COPY RIGHT. To Secure Your Paper As Per UGC Guidelines We Are Providing A Electronic Bar Code

COPY RIGHT. To Secure Your Paper As Per UGC Guidelines We Are Providing A Electronic Bar Code COPY RIGHT 2018IJIEMR.Personal use of this material is permitted. Permission from IJIEMR must be obtained for all other uses, in any current or future media, including reprinting/republishing this material

More information

Gated-Demultiplexer Tree Buffer for Low Power Using Clock Tree Based Gated Driver

Gated-Demultiplexer Tree Buffer for Low Power Using Clock Tree Based Gated Driver Gated-Demultiplexer Tree Buffer for Low Power Using Clock Tree Based Gated Driver E.Kanniga 1, N. Imocha Singh 2,K.Selva Rama Rathnam 3 Professor Department of Electronics and Telecommunication, Bharath

More information

MRT based Fixed Block size Transform Coding

MRT based Fixed Block size Transform Coding 3 MRT based Fixed Block size Transform Coding Contents 3.1 Transform Coding..64 3.1.1 Transform Selection...65 3.1.2 Sub-image size selection... 66 3.1.3 Bit Allocation.....67 3.2 Transform coding using

More information

VHDL Implementation of Multiplierless, High Performance DWT Filter Bank

VHDL Implementation of Multiplierless, High Performance DWT Filter Bank VHDL Implementation of Multiplierless, High Performance DWT Filter Bank Mr. M.M. Aswale 1, Prof. Ms. R.B Patil 2,Member ISTE Abstract The JPEG 2000 image coding standard employs the biorthogonal 9/7 wavelet

More information

Efficient Implementation of Low Power 2-D DCT Architecture

Efficient Implementation of Low Power 2-D DCT Architecture Vol. 3, Issue. 5, Sep - Oct. 2013 pp-3164-3169 ISSN: 2249-6645 Efficient Implementation of Low Power 2-D DCT Architecture 1 Kalyan Chakravarthy. K, 2 G.V.K.S.Prasad 1 M.Tech student, ECE, AKRG College

More information

Modified Welch Power Spectral Density Computation with Fast Fourier Transform

Modified Welch Power Spectral Density Computation with Fast Fourier Transform Modified Welch Power Spectral Density Computation with Fast Fourier Transform Sreelekha S 1, Sabi S 2 1 Department of Electronics and Communication, Sree Budha College of Engineering, Kerala, India 2 Professor,

More information

An Efficient Design of Sum-Modified Booth Recoder for Fused Add-Multiply Operator

An Efficient Design of Sum-Modified Booth Recoder for Fused Add-Multiply Operator An Efficient Design of Sum-Modified Booth Recoder for Fused Add-Multiply Operator M.Chitra Evangelin Christina Associate Professor Department of Electronics and Communication Engineering Francis Xavier

More information

Implementation of Lifting-Based Two Dimensional Discrete Wavelet Transform on FPGA Using Pipeline Architecture

Implementation of Lifting-Based Two Dimensional Discrete Wavelet Transform on FPGA Using Pipeline Architecture International Journal of Computer Trends and Technology (IJCTT) volume 5 number 5 Nov 2013 Implementation of Lifting-Based Two Dimensional Discrete Wavelet Transform on FPGA Using Pipeline Architecture

More information

High-Performance FIR Filter Architecture for Fixed and Reconfigurable Applications

High-Performance FIR Filter Architecture for Fixed and Reconfigurable Applications High-Performance FIR Filter Architecture for Fixed and Reconfigurable Applications Pallavi R. Yewale ME Student, Dept. of Electronics and Tele-communication, DYPCOE, Savitribai phule University, Pune,

More information

Image Error Concealment Based on Watermarking

Image Error Concealment Based on Watermarking Image Error Concealment Based on Watermarking Shinfeng D. Lin, Shih-Chieh Shie and Jie-Wei Chen Department of Computer Science and Information Engineering,National Dong Hwa Universuty, Hualien, Taiwan,

More information

IMPLEMENTATION OF AN ADAPTIVE FIR FILTER USING HIGH SPEED DISTRIBUTED ARITHMETIC

IMPLEMENTATION OF AN ADAPTIVE FIR FILTER USING HIGH SPEED DISTRIBUTED ARITHMETIC IMPLEMENTATION OF AN ADAPTIVE FIR FILTER USING HIGH SPEED DISTRIBUTED ARITHMETIC Thangamonikha.A 1, Dr.V.R.Balaji 2 1 PG Scholar, Department OF ECE, 2 Assitant Professor, Department of ECE 1, 2 Sri Krishna

More information

IN RECENT years, multimedia application has become more

IN RECENT years, multimedia application has become more 578 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 17, NO. 5, MAY 2007 A Fast Algorithm and Its VLSI Architecture for Fractional Motion Estimation for H.264/MPEG-4 AVC Video Coding

More information

Implementation of Two Level DWT VLSI Architecture

Implementation of Two Level DWT VLSI Architecture V. Revathi Tanuja et al Int. Journal of Engineering Research and Applications RESEARCH ARTICLE OPEN ACCESS Implementation of Two Level DWT VLSI Architecture V. Revathi Tanuja*, R V V Krishna ** *(Department

More information

MCM Based FIR Filter Architecture for High Performance

MCM Based FIR Filter Architecture for High Performance ISSN No: 2454-9614 MCM Based FIR Filter Architecture for High Performance R.Gopalana, A.Parameswari * Department Of Electronics and Communication Engineering, Velalar College of Engineering and Technology,

More information

OPTIMIZING THE POWER USING FUSED ADD MULTIPLIER

OPTIMIZING THE POWER USING FUSED ADD MULTIPLIER Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 11, November 2014,

More information

Design and Implementation of VLSI 8 Bit Systolic Array Multiplier

Design and Implementation of VLSI 8 Bit Systolic Array Multiplier Design and Implementation of VLSI 8 Bit Systolic Array Multiplier Khumanthem Devjit Singh, K. Jyothi MTech student (VLSI & ES), GIET, Rajahmundry, AP, India Associate Professor, Dept. of ECE, GIET, Rajahmundry,

More information

Design and Implementation of Signed, Rounded and Truncated Multipliers using Modified Booth Algorithm for Dsp Systems.

Design and Implementation of Signed, Rounded and Truncated Multipliers using Modified Booth Algorithm for Dsp Systems. Design and Implementation of Signed, Rounded and Truncated Multipliers using Modified Booth Algorithm for Dsp Systems. K. Ram Prakash 1, A.V.Sanju 2 1 Professor, 2 PG scholar, Department of Electronics

More information

Feedforward FFT Hardware Architectures Based on Rotator Allocation

Feedforward FFT Hardware Architectures Based on Rotator Allocation Feedforward FFT Hardware Architectures Based on Rotator Allocation Mario Garrido Gálvez, Shen-Jui Huang and Sau-Gee Chen The self-archived postprint version of this journal article is available at Linköping

More information

FPGA Based Low Area Motion Estimation with BISCD Architecture

FPGA Based Low Area Motion Estimation with BISCD Architecture www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 3 Issue 10 October, 2014 Page No. 8610-8614 FPGA Based Low Area Motion Estimation with BISCD Architecture R.Pragathi,

More information

A High-Speed FPGA Implementation of an RSD-Based ECC Processor

A High-Speed FPGA Implementation of an RSD-Based ECC Processor RESEARCH ARTICLE International Journal of Engineering and Techniques - Volume 4 Issue 1, Jan Feb 2018 A High-Speed FPGA Implementation of an RSD-Based ECC Processor 1 K Durga Prasad, 2 M.Suresh kumar 1

More information

Design and Implementation of CVNS Based Low Power 64-Bit Adder

Design and Implementation of CVNS Based Low Power 64-Bit Adder Design and Implementation of CVNS Based Low Power 64-Bit Adder Ch.Vijay Kumar Department of ECE Embedded Systems & VLSI Design Vishakhapatnam, India Sri.Sagara Pandu Department of ECE Embedded Systems

More information

Area And Power Efficient LMS Adaptive Filter With Low Adaptation Delay

Area And Power Efficient LMS Adaptive Filter With Low Adaptation Delay e-issn: 2349-9745 p-issn: 2393-8161 Scientific Journal Impact Factor (SJIF): 1.711 International Journal of Modern Trends in Engineering and Research www.ijmter.com Area And Power Efficient LMS Adaptive

More information

A 4096-Point Radix-4 Memory-Based FFT Using DSP Slices

A 4096-Point Radix-4 Memory-Based FFT Using DSP Slices A 4096-Point Radix-4 Memory-Based FFT Using DSP Slices Mario Garrido Gálvez, Miguel Angel Sanchez, Maria Luisa Lopez-Vallejo and Jesus Grajal Journal Article N.B.: When citing this work, cite the original

More information

LOW-POWER SPLIT-RADIX FFT PROCESSORS

LOW-POWER SPLIT-RADIX FFT PROCESSORS LOW-POWER SPLIT-RADIX FFT PROCESSORS Avinash 1, Manjunath Managuli 2, Suresh Babu D 3 ABSTRACT To design a split radix fast Fourier transform is an ideal person for the implementing of a low-power FFT

More information

An Efficient Hardware Architecture for H.264 Transform and Quantization Algorithms

An Efficient Hardware Architecture for H.264 Transform and Quantization Algorithms IJCSNS International Journal of Computer Science and Network Security, VOL.8 No.6, June 2008 167 An Efficient Hardware Architecture for H.264 Transform and Quantization Algorithms Logashanmugam.E*, Ramachandran.R**

More information