CHAPTER 4. DIGITAL DOWNCONVERTER FOR WiMAX SYSTEM

CHAPTER 4 IMPLEMENTATION OF DIGITAL UPCONVERTER AND DIGITAL DOWNCONVERTER FOR WiMAX SYSTEM 4.1 Introduction FPGAs provide an ideal implementation platform for developing broadband wireless systems such as WCDMA, WiMAX etc. To accelerate the performance of these broadband systems, state of the art high end and high performance FPGAs are used. FPGAs have gained rapid acceptance and growth over the past decade because they can be applied to a very wide range of applications. Using logic blocks and programmable routing resources, FPGAs can be configured to implement custom hardware functionality. As FPGAs are completely reconfigurable, so they can be reprogrammed for new applications. The development of high level design tools like system generator and DSP builder has resulted in small design cycle. As FPGAs are truly parallel in nature, different processing operations do not have to compete for the same resources. Each independent processing task is assigned to a dedicated section of the chip, and can function autonomously without any influence from other logic blocks. FPGAs are available which can be used for dedicated DSP applications. Thus the same filtering operations currently implemented in custom VLSI devices can now be implemented in a FPGA device ( Sun, M.T. et.al, 1989). Distributed Arithmetic (DA) can be explored to save resources in FPGA implementation of DSP functions. DA can be used to trade memory for combinatory elements, resulting in low cost look up table (LUT) based FPGAs implementation. Also the designer can select a serial or parallel DA implementation to trade off speed and resource utilization (Stanley A. White, 1989). 66

In this chapter FPGA implementation of DUC and DDC for WiMAX system have been proposed using DA. Different configurations for serial and parallel implementations are presented and compared. The resultant implementations are compared in terms of resource utilization for a Stratix II GX device. DSP builder is used to implement pipelining and scaling of parameters. Basics of DA architecture and methods to reduce the requirement of ROM are presented in section 4.2. Overview and architecture of Stratix II GX device are presented in section 4.3. Serial and parallel implementations of FIR filter with DA architecture are explored in section 4.4. Implementation of DUC and DDC is presented in sections 4.5 and 4.6 respectively. 4.2 Distributed Arithmetic Architecture DA is a very efficient mechanism to trade combinational logic with memory for high performance computation. DA can significantly help to save area in DSP hardware design. When the number of elements in a vector is nearly the same as the word size, DA is quite fast because it replaces the explicit multiplications by ROM look ups, which is an efficient technique to implement on Field Programmable Gate Arrays (FPGAs) ( Sun, M.T. 1989). Figure 4.1: Basic Architecture of Distributed Arithmetic In DA, multiplications are reordered and mixed in such a way that the arithmetic becomes 67

distributed through the structure rather than being lumped. With the advent of FPGA technology DA plays significant role to improve the system. The basic architecture for DA implementation has been shown in figure 4.1. For the DA implementation no multipliers are required. So accumulators, registers and read only memories (ROMs) are used for its implementation. The N bit registers are used to store the input vectors. This is shown with the help of an example, in which a general sum of product (SOP) equation that defines the response of linear, time invariant networks (4.1) is implemented with DA architecture shown in figure 4.2. y M 1 a b ( n) (4.1) n k k k 0 Where y is the response of network at time n, b ( n) is k th input variable at time n and n k ak is weighing factor of k th input variable that is constant for all n, and so it remains time invariant (Xilinx application note). Because the coefficients are constants, so these values can be precomputed. The output yn has only 2 M possible values, which can be stored in a 2 M size ROM. The bit serial input data can be used to directly address the ROM contents, which can be dropped into an accumulator to obtain the inner sum. Additional control circuitry is required to handle subtraction when the sign bit addresses the ROM (Chung, J. C., et al., 1998). The accumulator output converges to the final result after N cycles. To show this process a FIR filter implemented using the DA architecture is shown in figure 4.2. The input vector X holds four elements that are four bits each. The ROM contains all 16 combinations of the constant vector elements A i. Each of the X i elements is delivered one bit at a time, with the MSB first. Every clock cycle, the register contains the sum of the left shifted version of the previous register value and the current ROM contents. T s is the sign bit to control 68

Figure 4.2: FIR Filter using Distributed Arithmetic the addition/subtraction operation. When T s is high, the accumulator subtracts the current ROM contents from the left shifted version of the previous result and when it is low, the accumulator will add the current ROM contents to previous result. After four cycles, the register will have the final dot product. The only problem arises, is the increased size of the required ROM, which grows exponentially with each added input address line. For each element in a vector, there will be an address line. So there will be in total K address lines resulting in 2 K ROM. This increased ROM size problem can be reduced by two methods (Ansari, Z.A. 2003). The first method is based on the ROM decomposition, which is shown in figure 4.3. In this memory will be partioned in smaller parts, and by using an additional adder, all ROM outputs are added. The amount of memory is reduced from 2 N 2 words to 2 2 N 69

Figure 4.3: Reducing the memory using decomposition. words, if the original memory is partitioned into two parts. For N =8, the number of words to be store have reduced from 2 8 = 256 to 2 2 4 = 32. Hence, this approach reduces the memory significantly at the cost of an additional adder. The second approach is based on a special coding of the ROM content. Memory size can be halved by using the inventive scheme based on the identity 1 x x ( x) (4.2) 2 In two's complement representation, a negative number is obtained by inverting all bits and then adding a 1 to the least significant position of the original number. The identity 4.2 can be rewritten as (White. A. Stainley, 1989) 1 x x x x x 2 Wd 1 Wd 1 k k ( Wd 1 0 k2 ( 0 k2 2 ) k 1 k 1 (4.3) 70

0 0 Wd 1 1 k 1 k k k 1 Wd (4.4) x ( x x )2 ( x x )2 2 Notice that xk xk can only take on the values -1 or +1. Using this expression, for FIR filter equation yields Wd 1 k 1 1 k ( 1k, 2k,..., Nk )2 k ( 10, 20,..., N 0)2 Wd (0,0,...,0)2 (4.5) k 1 y F x x x F x x x F F ( x, x,..., x ) a ( x x ) Where 1 2 The function k k k Nk i k k i 1 N Fk is shown in Table 4.1 for N = 3. Table 4.1: Address and Contents of ROM x 1 x 2 x 3 0 0 0 0 0 1 0 1 0 0 1 1 1 0 0 1 0 1 1 1 0 1 1 1 F k y 1 y 2 AS a1 a2 a 0 0 A 3 a a a 0 1 A 1 2 3 a1 a2 a3 1 0 A a1 a2 a3 1 1 A a a a 1 1 S 1 2 3 a1 a2 a3 1 0 S a1 a2 a3 0 1 S a a a 0 0 S 1 2 3 Notice that only half the values are needed, since the other half can be obtained by changing the signs. To explore this redundancy, some address modification is done, shown to the right in table 4.1 by using 4.6 and 4.7. y1 x1 x2 (4.6) y2 x1 x3 (4.7) Here, variable x 1 has been selected as the control signal.the add/sub control (i.e., x 1) 71

must also provide the correct addition/subtraction function when the sign bits are accumulated. Therefore, following control signal is used to address the ROM: A S x1 xsignbit (4.8) Where the control signal x signbit is zero at all times except when the sign bit arrives. Figure 4.4 shows the resulting principle for distributed arithmetic with halved ROM. Only N 1 variables are used to address the memory. The XOR gates used for halving the memory can be merged with the XOR gates used for inverting the function F. k Figure 4.4: Distributed arithmetic with smaller ROM This technique for reducing the memory size can easily be implemented using a small modification of the shift accumulator. 4.3 General FPGA Architecture Major FPGA specifications include the amount of configurable logic blocks (CLBs), the number of fixed function logic blocks, such as multipliers, and size of memory resources. Although there are many other parts of an FPGA chip, but these are typically the most 72

Figure 4.5: Different Parts of an FPGA important when selecting and comparing FPGAs. The configurable blocks of logic, such as slices or logic cells, are made up of two basic things: flip-flops and LUTs. Figure 4.5 shows the different parts of FPGA. Figure 4.6: Structure of an FPGA The structure of FPGA is array based, meaning that each chip comprises a two dimensional array of logic blocks that can be interconnected via horizontal and vertical 73

routing channels. An illustration of this type of architecture is shown in figure 4.6. The CLB is based on LUTs. A LUT is a small one bit wide memory array, where the address lines for the memory are inputs of the logic block and the one bit output from the memory is the LUT output. A LUT with K inputs would then correspond to a 2K x 1 bit memory and can realize any logic function of its K inputs by programming the logic function s truth table directly into the memory. 4.3.1 Stratix II FPGAs The Stratix II family of FPGAs is based on a 1.5 V, 0.13 μm, all layer copper SRAM process, with densities of up to 79,040 logic elements (LEs) and upto 7.5 MB of RAM (Altera publication, 2002). Stratix devices offer up to 22 digital signal processing (DSP) blocks with up to 176 (9-bit 9-bit) embedded multipliers, optimized for DSP applications that enable efficient implementation of high performance filters. Stratix devices support various I/O standards and also offer a complete clock management solution with its hierarchical clock structure with up to 420 MHz performance. Stratix devices contain a two dimensional row and column based architecture to implement custom logic. A series of column and row interconnects of varying length and speed provide signal interconnects between logic array blocks (LABs), memory block structures, and DSP blocks. The logic array consists of LABs, with 10 logic elements (LEs) in each LAB. An LE is a small unit of logic providing efficient implementation of user logic functions. LABs are grouped into rows and columns across the device. M512 RAM blocks are simple dual port memory blocks with 512 bits. These blocks provide dedicated simple dual port or single port memory up to 18 bits wide. M512 blocks are grouped into columns across the device in between certain LABs. M4K RAM blocks are dual port memory blocks with 4K bits plus parity (4,608 bits). These blocks provide dedicated dual port, simple dual port, or single port memory up to 36 bits wide. These 74

blocks are grouped into columns across the device in between certain LABs. M-RAM blocks are dual port memory blocks with 512K bits. These blocks provide dedicated dual port, simple dual port, or single port memory up to 144-bits wide. Several M-RAM blocks are located individually or in pairs within the device s logic array. DSP blocks can implement up to either eight full precision 9 9-bit multipliers, four full-precision 18 18-bit multipliers, or one full-precision 36 36-bit multiplier with add or subtract features. These blocks also contain 18-bit input shift registers for digital signal processing applications, including FIR and infinite impulse response (IIR) filters. DSP blocks are grouped into two columns in each device (Altera publication, 2002). Figure 4.7: Block Diagram of Stratix II FPGA 75

Each Stratix device I/O pin is fed by an I/O element (IOE) located at the end of LAB rows and columns around the periphery of the device. I/O pins support numerous single ended and differential I/O standards. Each IOE contains a bidirectional I/O buffer and six registers for registering input, output, and output enable signals.the number of M512 RAM, M4K RAM, and DSP blocks varies by device along with row and column numbers and M-RAM blocks. 4.3.1.1 Logic Array Blocks (LABs) The LAB local interconnect can drive LEs within the same LAB. The LAB local interconnect is driven by column and row interconnects and LE outputs within the same LAB (Altera publication, 2002).. Figure 4.8: Stratix LAB Structure Neighbouring LABs, M512 RAM blocks, M4K RAM blocks, or DSP blocks from the left and right can also drive an LAB s local interconnect through the direct link connection. The direct link connection feature minimizes the use of row and column interconnects, 76

providing higher performance and flexibility. Each LE can drive 30 other LEs through fast local and direct link interconnects. Each LAB contains dedicated logic for driving control signals to its LEs. The control signals include two clocks, two clock enables, two asynchronous clears, synchronous clear, asynchronous preset/load, synchronous load, and add/subtract control signals. This gives a maximum of 10 control signals at a time. Although synchronous load and clear signals are generally used when implementing counters, they can also be used with other functions. Each LAB s clock and clock enable signals are linked. If the LAB uses both the rising and falling edges of a clock, it also uses both LAB clock signals. Deasserting the clock enable signal will turn off the LAB clock. Each LAB can use two asynchronous clear signals and an asynchronous load/preset signal. The asynchronous load acts as a preset when the asynchronous load data input is tied high. With the LAB addnsub ( see figure 4.9) control signal, a single LE can implement a one bit adder and subtractor. This saves LE resources and improves performance for logic functions such as DSP correlators and signed multipliers that alternate between addition and subtraction depending on data. 4.3.1.2 Logic Elements (LEs) The smallest unit of logic in the Stratix architecture, the LE, is compact and provides advanced features with efficient logic utilization. Each LE contains a four-input LUT, which is a function generator that can implement any function of four variables (Altera publication, 2002). In addition, each LE contains a programmable register and carry chain with carry select capability. A single LE also supports dynamic single bit addition or subtraction mode selectable by an LAB-wide control signal. Each LE drives all types of interconnects: local, row, column, LUT chain, register chain, and direct link interconnects. Each LE s programmable register can be configured for D, T, JK or SR operation. 77

Figure 4.9: Block Diagram of Stratix LE Each register has data, true asynchronous load data, clock, clock enable, clear, and asynchronous load/preset inputs. Global signals, general-purpose I/O pins, or any internal logic can drive the register s clock and clear control signals. Either general purpose I/O pins or internal logic can drive the clock enable, preset, asynchronous load, and asynchronous data. The asynchronous load data input comes from the data 3 input of the LE. Each LE has three outputs that drive the local, row, and column routing resources. The LUT or register output can drive these three outputs independently. Two LE outputs drive column or row and direct link routing connections and one drives local interconnect resources. This allows the LUT to drive one output while the register drives other output. This improves device utilization because the device can use the register and LAB LUT routing from previous LE functions. 4.3.1.3 TriMatrix Memory TriMatrix memory consists of three types of RAM blocks: M512, M4K, and M-RAM blocks (Altera publication, 2002). Although these memory blocks are different, still they 78

all can implement various types of memory with or without parity, including true dual port, simple dual port, and single port RAM, ROM, and FIFO buffers. The largest TriMatrix memory block, the M-RAM block, is useful for applications where a large volume of data must be stored on-chip. The M-RAM block can be configured in true dual port RAM, simple dual port RAM, single port RAM and FIFO RAM mode. Only synchronous operation is supported in the M-RAM block. The memory address and output width can be configured as 64K 8 bits, 32K 16 bits, 16K 32 bits, 8K 64 bits, and 4K 128 bits. Mixed width configurations are also possible, allowing different read and write widths. 4.3.1.4 Digital Signal Processing Block The most commonly used DSP functions are finite impulse response (FIR) filters, complex FIR filters, infinite impulse response (IIR) filters, fast Fourier transform (FFT) functions and direct cosine transform (DCT) functions. Additionally, some applications need specialized operations such as multiply-add and multiply accumulate operations. Stratix devices provide DSP blocks to meet the arithmetic requirements of these functions. Each Stratix device has two columns of DSP blocks to efficiently implement DSP functions faster than LE-based implementations. Each DSP block can be configured to support up to eight 9 9-bit multipliers, eour 18 18-bit multipliers or one 36 36-bit multiplier (Altera publication, 2002). As indicated, the Stratix DSP block can support one 36 36-bit multiplier in a single DSP block. This is true for any matched sign multiplications, but the capabilities for dynamic and mixed sign multiplications are handled differently. The the largest functions that can fit into a single DSP block can be 36 36-bit unsigned by unsigned multiplication, 36 36-bit signed by signed multiplication, 35 36-bit unsigned by signed multiplication, 36 35-bit signed by unsigned multiplication, 36 35-bit signed by 79

dynamic sign multiplication, 35 36-bit dynamic sign by signed multiplication, 35 36- bit unsigned by dynamic sign multiplication, 36 35-bit dynamic sign by unsigned multiplication, 35 35-bit dynamic sign multiplication when the sign controls for each operand are different or 36 36-bit dynamic sign multiplication when the same sign control is used for both operands. DSP block multipliers can optionally feed an adder/subtractor or accumulator within the block depending on the configuration. This makes routing to LEs easier, saves LE routing resources, and increases performance, because all connections and blocks are within the DSP block. So the DSP block registers can be efficiently used to implement shift registers for FIR filter applications. 4.3.1.5 Modes of Operation The adder, subtractor, and accumulate functions of a DSP block have simple multiplier, multiply accumulator and multipliers adder modes of operation. In simple multiplier mode, shown in figure 4.10, the DSP block drives the multiplier sub block result directly to the output with or without an output register. Up to four 18 18-bit multipliers or eight 9 9-bit multipliers can drive their results directly out of one DSP block. DSP blocks can also implement one 36 36-bit multiplier in multiplier mode. DSP blocks use four 18 18-bit multipliers combined with dedicated adder and internal shift circuitry to achieve 36- bit multiplication. In MAC mode, the DSP block drives multiplied results to the adder/subtractor/accumulator block configured as an accumulator as shown in figure 4.11. Two multiply-accumulators up to 18 18 bits can be implemented in one DSP block. The first and third multiplier subblocks are unused in this mode, because only one multiplier can feed one of two accumulators. The multiply accumulator output can be up to 52 bits. The addnsub signal can set the accumulator for decimation and the overflow signal indicates underflow condition (Altera publication, 2002). For FIR filters, the DSP block combines the four multipliers adder mode with the shift register inputs. 80

Figure 4.10: Block Diagram of DSP block in Simple Multiplier Mode Figure 4.11: Block Diagram of DSP block in Multiply Accumulate Mode 81

One set of shift inputs contains the filter data, while the other holds the coefficients loaded in serial or parallel. The input shift register eliminates the need for shift registers external to the DSP block. This architecture simplifies filter design since the DSP block implements all of the filter circuitry. One DSP block can implement an entire 18-bit FIR filter with up to four taps. Figure 4.12: Block Diagram of DSP block in Four Multiplier Adder Mode For higher configuration filter implementation, DSP blocks can be cascaded accordingly (Altera publication, 2002). 82

4.3.1.6 I/O Structure The IOE in Stratix devices contains a bidirectional I/O buffer, six registers and a latch for a complete embedded bidirectional single data rate or DDR transfer. As shown in figure 4.13, the IOE contains two input registers with latch, two output registers and two output enable registers. The design can use both input registers and the latch to capture DDR input and both output registers to drive DDR outputs. Figure 4.13: Stratix IOE structure Additionally, the design can use the output enable register for fast clock to output enable timing. The negative edge-clocked OE register is used for DDR SDRAM interfacing. The 83

Quartus II software automatically duplicates a single OE register that controls multiple output or bidirectional pins. The IOEs are located in I/O blocks around the periphery of the Stratix device. There are up to four IOEs per row I/O block and six IOEs per column I/O block. The row I/O blocks drive row, column, or direct link interconnects. The column I/O blocks drive column interconnects (Altera publication, 2002). Although by using the FPGA architecture in efficient manner, resources can be reduced but with the help of DA using suitable structural implementation, further improvement in the FPGA design can be obtained. 4.4 Distributed Arithmetic FIR Filter As discussed in chapter 3, FIR filters have the advantage of linear phase, high stability, fewer finite precision errors and efficient implementation. But still they suffers from the requirement of higher order i.e. more coefficients are required as compared to IIR filter. This high order demand imposes more hardware requirements, arithmetic operations, area usage and power consumption when designing and fabricating the filter. Therefore reducing these parameters is a major objective which can be attained with the help of efficient use of DA in FPGA implementation. Mathematically FIR filter can be shown as N k (4.9) k 0 y[ n] a x[ n k] In Equation 4.9, x[n] represents the input, y[n] represents the filter output and ak represents the filter coefficients. This filter is of Nth order and it contains N+1 taps. Equation 4.9 can be implemented conventionally by using multipliers, adders and delay elements as shown in figure 4.14. The delay elements can be implemented using memory elements and at any time only N most recent inputs need to be stored (Chang, T. S. and Jen, C. W., 1999). But implementing the FIR filter in this manner using MAC units is expensive as it consumes N+1 MAC units which are very high for the filter order of N. 84

Figure 4.14: Conventional method for FIR Filter Implementation To overcome this problem of high MAC unit requirements, DA architecture can be used, which is very efficient in implementing the Sum Of Products (SOP) (Stanley A. White, 1989). DA implements MAC operations using LUTs/ROMs instead of dedicated multipliers. DA is bit serial in nature and parallel implementations can be developed by using serial DA FIRs in parallel. Let the input variable x[n k], which is in 2 s complement fixed point fractional format, contain M bits and let x[n k] < 1. It can then be expressed as M 1 m x[ n k] x x 2 (4.10) k, o k, m m 0 In Equation 4.10, k,0 x is the Most Significant Bit (MSB) or sign bit and k, M 1 x is the Least Significant Bit (LSB) of the M bit variable x [n-k]. It must be noted that k, m, x, are binary variables and can only assume values 0 or 1. Substituting Equation 4.10 in Equation 4.9, we get N N M 1 m k,0 k k, m k (4.11) k 1 k 0 m 0 y[ n] x a x a 2 85

Equation 4.11 can be expanded and rearranged shown as, y[ n] [ x. a x. a x. a... x. a ] 0,0 0 1,0 1 2,0 2 N,0 1 [ x0,1. a 0 x1,1. a1 x2,1. a 2... xn,1. a N]2 n 2 [ x0,2. a 0 x1,2. a1 x2,2. a 2... xn,2. a N]2 [ x. a x. a x. a... x. a ]2 M 0, M 1 0 1, M 1 1 2, M 1 2 N, M 1 N 1 (4.12) In Equation 4.12, each inner term inside the square brackets denotes a logical AND operation and the plus sign denote arithmetic addition. The negative powers of 2, which appear outside the brackets can be implemented simply by shifting the results of the computation to the right. So the MAC operations in Equation 4.9 are now converted to addition, subtraction, shifting and logical AND operations (Stanley A. White, 1989). Bits of the input variable can be used to address the LUT. A serial DA FIR filter can be constructed using a single LUT and time sharing it to process all the bits. Input shift registers (ISR) are required to supply bits serially to the LUT in serial DA FIR filter shown in figure 4.15. Bits are output from the ISR MSB first. To construct a parallel DA FIR filter shown in figure 4.16 M LUTs are required. The 1 st bits of all the inputs are connected to the 1st LUT, 2 nd bits of all the inputs are connected to 2 nd LUT and so on. (Tyler J. Moeller and David R. Martinez, 1999). The parallel filter produces one output every clock cycle whereas the serial filter produces one output every M clock cycles. The address and LUT contents has been calculated from equation 4.13 and shown in table 4.2. F x0,0 a0 x1,0 a1 x2,0 a2 (4.13) 86

Table 4.2: Address and Contents of an LUT x 0,0 x 1,0 x 2,0 Contents 0 0 0 0 0 0 1 a 2 0 1 0 a 1 0 1 1 a2 a1 1 0 0 a 0 1 0 1 a0 a2 1 1 0 a0 a1 1 1 1 a0 a1 a2 Figure 4.15: Serial Distributed Arithmetic FIR Filter Since all channels have the same filtering requirements, a multi channel DA FIR filter can be constructed by time sharing LUTs across data from multiple channels. For a multi channel DA FIR filter, memory required the amount of memory required to store input variables will be more since it has to store input variables of multiple streams, but the logic resources required to compute results would be the same as a single channel filter. As the filter processes input data one bit at a time per clock cycle, therefore 87

Figure 4.16: Parallel Distributed Arithmetic FIR Filter serial structures will require clock cycles equal to the input data width to calculate an output. In contrast, a parallel structure calculates the filter output in a single clock cycle, so parallel structures provide the highest speed performance at the expense of large area. Another option is a multibit serial structure combines several small serial FIR filters in parallel to generate the FIR output. This structure provides greater throughput than a standard serial structure while using less area than a fully parallel structure. Thus different architectures can be used depending upon the specific requirement in term of area or speed. 4.5 Design and Implementation of Proposed Digital Up Converter for WiMAX System In this section design and implementation of the proposed DUC for WiMAX system using DA is presented. For its implementation, different architectures like fully serial, multibit serial and fully parallel architectures are used to choose the best architecture. The 88

interpolation filters are implemented using Nyquist FIR design with direct form polyphase structure. The input sample frequency, passband ripple and stpopband attenuation are taken as 11.2 MHz, 0.015 db and 60 db respectively. The interpolation factor is taken as 8. Proposed DUC is implemented by cascading pusle shaping single rate FIR filter, interpolaion by 2 filter and interpolation by 4 filter. The design and implementation of these pulse shaping single rate FIR filter, interpolaion by 2 filter and interpolation by 4 filters are presented in the following sub sections. 4.5.1 Design and Implementation of Pulse Shaping Single Rate FIR Filter In the DUC, pulse shaping filter is used to attenuate out of band power in order to meet the spectral mask requirement. RRC is a favorable filter to do pulse shaping as it transition band response meets the Nyquist criteria. The pulse shaping single rate FIR filter is designed with roll off factor 0.25 and stop band attenuation of 60 db. The passband and stopband frequencies is taken as 4.65 MHz and 5.35 MHz respectively. The pulse shaping single rate FIR filter is designed and implemented for fully serial, multibit serial and fully parallel architectures. The resources utilized by different architectures and their performance in term of speed is shown in tables 4.3 and 4.4. From table 4.3, it is concluded that in case of DA fully serial architecture for interpolation single rate channel filter, as the number of serial units are increased from 1 to 4, the number of logic cells increases from 3941 to 4051 i.e. there is an increase of 2.8% Whereas number of clock cycles required to process input and output data decreases from 16 to 4 i.e. the speed increases by fourfold. The results for fully parallel architecture implementation are shown in table 4.4. From table 4.4, it is concluded that DA fully parallel architecture with the pipeline level 1 provides the best performance among all parallel architectures. On analyzing the results of tables 4.3 and 4.4, it is concluded that DA fully serial architecture having 4 numbers of 89

Table 4.3: Comparison of FPGA Resource Utilization by Distributed Arithmetic Fully Serial Interpolator Single Rate Filter with different Number of Serial Units FPGA Resources No. of Serial Units =1 No. of Serial Units No. of Serial Units =2 No. of Serial Units =4 Logic Cells 3916 3941 4051 M512 1 1 1 M4K 0 0 0 Process Input Data Generate Output Data 16 8 4 16 8 4 Table 4.4: Comparison of FPGA Resource Utilization by Distributed Arithmetic Fully Parallel Interpolator Single Rate Filter with different levels of Pipelining Resources Pipeline Level Pipeline Level 1 Pipeline Level 2 Pipeline Level 3 Logic Cells 5137 5749 6505 M512 1 1 1 M4K 0 0 0 Process Input Data Generate Output Data 1 1 1 1 1 1 90

serial units requires 4051 Logic cells, whereas DA fully parallel architecture with pipeline level of 1 requires 5137 Logic cells. And DA fully parallel architecture with pipeline level of 1 requires 1 clock cycle to process input data and 1 clock cycle to generate output data whereas DA fully serial architecture having 4 numbers of serial units requires 4 clock cycles to process input data and 4 clock cycles to generate output data. Thus as compared to DA fully serial architecture having 4 numbers of serial units, the speed of DA fully parallel architecture with pipeline level of 1 increases by four folds at an expense of only about 26.8% of FPGA resources. As best result in term of speed are obtained in fully parallel architecture with pipeline level of 1, so for this filter design, fully parallel architecture with pipeline level 1 is used. 4.5.2 Design and Implementation of Interpolation by 2 FIR Filter In interpolation by 2 filter, the input sample rate will be 11.2 Msps and at output, it will provide 22.4 Msps. So interpolation by 2 filter is designed with input sample rate 11.2 Msps, passband ripple of 0.015, stopband attenuation of 60 db and interpolation factor of 2. This interpolation by 2 filter is implemented for fully serial, multibit serial and fully parallel architectures. The resources utilized by different architectures and their performance in term of speed is shown in tables 4.5 and 4.6. From table 4.5, it is concluded that in case of DA fully serial architecture for interpolation by 2 filter, as the number of serial units are increased from 1 to 4, the number of logic cells increases from 523 to 1021 i.e. there is an increase of approximately 95%. Whereas number of clock cycles required to process input data decreases from 32 to 8 and number of clock cycles required to generate output data decreases from 16 to 4 i.e. the speed increases by fourfold. Table 4.6 shows the result for fully parallel architecture with pilpeline levels 1, 2 and 3. Pipeline level 1 shows the best results in term of speed and less resources in fully 91

Table 4.5: Comparison of FPGA Resource Utilization by Distributed Arithmetic Fully Serial Interpolation by 2 Filter with different Number of Serial Units FPGA Resources No. of Serial Units =1 No. of Serial Units No. of Serial Units =2 No. of Serial Units =4 Logic Cells 523 697 1021 M512 2 2 2 M4K 2 4 8 Process Input Data Generate Output Data 32 16 8 16 8 4 Table 4.6: Comparison of FPGA Resource Utilization by Distributed Arithmetic Fully Parallel Interpolation by 2 Filter with different levels of Pipelining Resources Pipeline Level Pipeline Level 1 Pipeline Level 2 Pipeline Level 3 Logic Cells 1890 2000 3716 M512 2 2 2 M4K 18 18 18 Process Input Data Generate Output Data 2 2 2 1 1 1 92

parallel architectures. On comparing the results of tables 4.5 and 4.6, it is concluded that DA fully serial architecture having 4 numbers of serial units requires 1021 logic cells, whereas DA fully parallel architecture with pipeline level of 1 requires 1890 logic cells. Also DA fully parallel architecture with pipeline level of 1 requires 2 clock cycle to process input data and 1 clock cycle to generate output data whereas DA fully serial architecture having 4 numbers of serial units requires 8 clock cycles to process input data and 4 clock cycles to generate the output data. Thus as compared to DA fully serial architecture having 4 numbers of serial units, the speed of DA fully parallel architecture with pipeline level of 1 increases by four folds at an expense of about 85% of logic cells. 4.5.3 Design and Implementation of Interpolation by 4 FIR Filter In the DUC, after the signal get interpolated by 2, now it will be interpolated by 4 to get the required interpolation factor 8. The input sample rate for interpolation by 4 filter is 22.4 Msps, passband ripple is 0.015 db and stopband attenuation is 60 db. This interpolation by 4 filter is designed and implemented for fully serial, multibit serial and fully parallel architectures. The resources utilized by different architectures and their performance in term of speed is shown in tables 4.7 and 4.8. From table 4.7, it is concluded that in case of DA fully serial architecture for interpolation by 4 filter, as the number of serial units are increased from 1 to 4, the number of logic cells increases from 584 to 818 i.e. there is an increase of approximately 39%. Whereas number of clock cycles required to process input data decreases from 64 to 16 and number of clock cycles required to generate output data decreases from 16 to 4 i.e. the speed increases by fourfold. From table 4.8, it is concluded that in case of DA fully parallel architecture for interpolation by 4 filter, among all pipeline levels, the pipeline level 1 provides the best result in term of speed with less required resources. On comparing the results of tables 4.7 and 4.8, it is concluded that DA fully serial 93

Table 4.7: Comparison of FPGA Resource Utilization by Distributed Arithmetic Fully Serial Interpolation by 4 Filter with different Number of Serial Units FPGA Resources No. of Serial Units =1 No. of Serial Units No. of Serial Units =2 No. of Serial Units =4 Logic Cells 584 654 818 M512 1 1 1 M4K 1 1 1 Process Input Data Generate Output Data 64 32 16 16 8 4 Table 4.8: Comparison of FPGA Resource Utilization by Distributed Arithmetic Fully Parallel Interpolation by 4 Filter with different levels of Pipelining Resources Pipeline Level 1 Pipeline Level Pipeline Level 2 Pipeline Level 3 Logic Cells 1038 1232 2172 M512 1 1 1 M4K 6 6 6 Process Input Data Generate Output Data 4 4 4 1 1 1 94

architecture having 4 numbers of serial units requires 818 logic cells, whereas DA fully parallel architecture with pipeline level of 1 requires 1038 logic cells. Also DA fully parallel architecture with pipeline level of 1 requires 4 clock cycle to process input data and 1 clock cycle to generate output data whereas DA fully serial architecture having 4 numbers of serial units requires 8 clock cycles to process input data and 4 clock cycles to generate the output data. Thus as compared to DA fully serial architecture having 4 numbers of serial units, the speed of DA fully parallel architecture with pipeline level of 1 increases by four folds at an expense of about 27% of logic cells. Figure 4.17: Logic cells used by different stages of DUC with different number of serial units for fully serial DA architecture The variations of the number of logic cells used by pulse shaping, interpolation by 2 and interpolation by 4 filters, for fully serial DA architecture with different number of serial units is shown in figure 4.17 and for fully parallel DA architecture with different number of pipeline levels is shown in figure 4.18. From above discussions, it is concluded that for implementing different stages, fully parallel DA architecture with pipeline level of 1 provides high speed with moderate area requirement. So, in the proposed design fully 95

parallel DA architecture with pipeline level of 1 is used to implement all the interpolator stages for DUC for WiMAX system. Figure 4.18: Logic cells used by different stages of DUC with different levels of pipelining for fully parallel DA architecture 4.6 Design and Implementation of Proposed Digital Down Converter for WiMAX System In this section design and implementation of the proposed DDC for WiMAX system using DA is presented. For its implementation, different architectures like fully serial, multibit serial and fully parallel architectures are used to choose the best architecture. The decimation filters are inplemented using Nyquist FIR design with direct form polyphase structure. The input sample rate, passband ripple and stpopband attenuation are taken as 89.6 Msps, 0.015 db and 60 db respectively. The overall decimation factor is taken as 8. Proposed DDC is implemented by cascading decimation by 4 filter, decimation by 2 and decimation channel filters. The design and implementation of these decimation by 4 filter, 96

decimation by 2 and channel filters are presented in the following sub sections. 4.6.1 Design and Implementation of Decimation by 4 FIR Filter Decimation by 4 filter will downconvert the sample rate from 89.6 Msps to 22.4 Msps. The design specifications for its implementation have been taken as stopband attenuation 60dB, passband attenuation 0.015 db, decimation factor 4. This decimation by 4 filter is designed and implemented for fully serial, multibit serial and fully parallel architectures. The resources utilized by different architectures and their performance in term of speed is shown in tables 4.9 and 4.10. Table 4.9: Comparison of FPGA Resource Utilization by Distributed Arithmetic Fully Serial Decimation by 4 Filter with different Number of Serial Units FPGA Resources No. of Serial Units =1 No. of Serial Units No. of Serial Units =2 No. of Serial Units =4 Logic Cells 590 660 824 M512 0 0 0 M4K 1 1 1 Process Input Data Generate Output Data 16 8 4 64 32 16 From table 4.9, it is concluded that in case of DA fully serial architecture for decimation by 4 filter, as the number of serial units are increased from 1 to 4, the number of logic cells increases from 590 to 824 i.e. there is an increase in required logic cells is 39%. But the number of clock cycles required to process input data decreases from 16 to 4 and number of clock cycles required to generate output data decreases from 64 to 16 i.e. the speed increases by fourfold. From table 4.10, it is concluded that DA fully parallel 97

Table 4.10: Comparison of FPGA Resource Utilization by Distributed Arithmetic Fully Parallel Decimation by 4 Filter with different levels of Pipelining Resources Pipeline Level 1 Pipeline Level Pipeline Level 2 Pipeline Level 3 Logic Cells 1044 1238 2180 M512 0 0 0 M4K 6 6 6 Process Input Data Generate Output Data 1 1 1 4 4 4 architecture with pipeline level 1 outperforms other pipeline architectures. On comparing the results of tables 4.9 and 4.10, it is concluded that DA fully serial architecture having 4 numbers of serial units requires 824 logic cells, whereas DA fully parallel architecture with pipeline level of 1 requires 1044 logic cells. Also DA fully parallel architecture with pipeline level of 1 requires 4 clock cycle to process input data and 1 clock cycle to generate output data whereas DA fully serial architecture having 4 numbers of serial units requires 8 clock cycles to process input data and 4 clock cycles to generate the output data. Thus as compared to DA fully serial architecture having 4 numbers of serial units, the speed of DA fully parallel architecture with pipeline level of 1 increases by four folds at an expense of about 26% of logic cells. so this filter design is implemented with DA fully parallel architecture with pipeline level 1. 4.6.2 Design and Implementation of Decimation by 2 FIR Filter In the DDC, after decimation by 4 filter, decimation by 2 filter will be used. Its function is to downconvert the sample rate further by factor 2. So the input sample rate for 98

this filter will be 22.4 Msps and the output sample rate will be 11.2 Msps. In other design specifications, the passband ripple and stopband attenuation are taken as 0.015 db and 60 db. This decimation by 2 filter is designed and implemented for fully serial, multibit serial and fully parallel architectures. The resources utilized by different architectures and their performance in term of speed are shown in tables 4.11 and 4.12. From table 4.11, it is concluded that in case of DA fully serial architecture for decimation by 2 filter, as the number of serial units are increased from 1 to 4, the number of logic cells increases from Table 4.11: Comparison of FPGA Resource Utilization by Distributed Arithmetic Fully Serial Decimation by 2 Filter with different Number of Serial Units FPGA Resources No. of Serial Units =1 No. of Serial Units No. of Serial Units =2 No. of Serial Units =4 Logic Cells 526 700 1024 M512 1 1 1 M4K 2 4 8 Process Input Data Generate Output Data 16 8 4 32 16 8 526 to 1024 i.e. there is an increase of approximately 94%. Whereas number of clock cycles required to process input data decreases from 16 to 4 and number of clock cycles required to generate output data decreases from 32 to 8 i.e. the speed increases by fourfold From table 4.12, it can be seen that in case of DA fully parallel architecture with pipeline level 1 provides best performance in term of speed with lesser resources as compared to other parallel structures. On comparing the results of tables 4.11 and 4.12, it 99

Table 4.12: Comparison of FPGA Resource Utilization by Distributed Arithmetic Fully Parallel Decimation by 2 Filter with different levels of Pipelining Resources Pipeline Level Pipeline Level 1 Pipeline Level 2 Pipeline Level 3 Logic Cells 1893 2003 3719 M512 1 1 1 M4K 18 18 18 Process Input Data Generate Output Data 1 1 1 2 2 2 is concluded that DA fully serial architecture having 4 numbers of serial units requires 1024 logic cells, whereas DA fully parallel architecture with pipeline level of 1 requires 1893 logic cells. Also DA fully parallel architecture with pipeline level of 1 requires 4 clock cycle to process input data and 1 clock cycle to generate output data whereas DA fully serial architecture having 4 numbers of serial units requires 8 clock cycles to process input data and 4 clock cycles to generate the output data. Thus as compared to DA fully serial architecture having 4 numbers of serial units, the speed of DA fully parallel architecture with pipeline level of 1 increases by four folds at an expense of about 84% of logic cells. So the decimation by 2 filter is designed with fully parallel architecture with pipeline level 1. 4.6.3 Design and Implementation of Decimation Channel Filter In the DDC, the channel filter is used after decimation by 2 filter. The main function of this filter is to provide stopband attenuation to remove adjacent channel interference. In 100

addition, it also have to keep passband ripple with in range. For this filter RRC filter with Nyquist design is used with roll off factor 0.25, stopband attenuation 60 db. This decimation channel filter is designed and implemented for fully serial, multibit serial and fully parallel architectures. The resources utilized by different architectures and their performance in term of speed are shown in tables 4.13 and 4.14. Table 4.13: Comparison of FPGA Resource Utilization by Distributed Arithmetic Fully Serial Decimator Channel Filter with different Number of Serial Units FPGA Resources No. of Serial Units =1 No. of Serial Units No. of Serial Units =2 No. of Serial Units =4 Logic Cells 2093 2147 2255 M512 1 1 1 M4K 0 0 0 Process Input Data Generate Output Data 16 8 4 16 8 4 From table 4.13, it is concluded that in case of DA fully serial architecture for single rate channel filter of DDC, as the number of serial units are increased from 1 to 4, the number of logic cells increases from 2093 to 2255 i.e. there is an increase of approximately 7%. Whereas number of clock cycles required to process input and output data decreases from 16 to 4 i.e. the speed increases by fourfold. From table 4.14, it is concluded that in case of DA fully parallel architecture for single rate channel filter, among other pipeline level parallel structures, the pipeline level 1 parallel structure provides the best performance in 101

Table 4.14: Comparison of FPGA Resource Utilization by Distributed Arithmetic Fully Parallel Decimator Channel Filter with different levels of Pipelining Pipeline Level Resources Pipeline Level 1 Pipeline Level 2 Pipeline Level 3 Logic Cells 3148 3613 4319 M512 1 1 1 M4K 0 0 0 Process Input Data Generate Output Data 1 1 1 1 1 1 term of speed with lesser area. On comparing the results of tables 4.13 and 4.14, it is concluded that DA fully serial architecture having 4 numbers of serial units requires 2255 logic cells, whereas DA fully parallel architecture with pipeline level of 1 requires 3148 logic cells. Also DA fully parallel architecture with pipeline level of 1 requires 1 clock cycle to process input data and 1 clock cycle to generate output data whereas DA fully serial architecture having 4 numbers of serial units requires 4 clock cycles to process input data and 4 clock cycles to generate output data. Thus as compared to DA fully serial architecture having 4 numbers of serial units, the speed of DA fully parallel architecture with pipeline level of 1 increases by four folds at an expense of about 39% logic cells. so this filter is designed with DA fully architecture with pipeline level 1. The variations of the number of logic cells used by decimation by 4, decimation by 2 and decimation channel filters, for fully serial DA architecture with different number of 102

Figure 4.19: Logic cells used by different stages of DDC with different number of serial units for fully serial DA architecture Figure 4.20: Logic cells used by different stages of DDC with different levels of pipelining for fully parallel DA architecture serial units are shown in figure 4.19 and for fully parallel DA architecture with different number of pipeline levels are shown in figure 4.20. From these discussions, it is concluded that fully parallel DA architecture with pipeline level of 1 has high speed with 103

moderate area requirement. So, in the proposed design fully parallel DA architecture with pipeline level of 1 is used to implement all decimator stages of DUC for WiMAX system. So in the proposed design fully parallel DA architecture with pipeline level of 1 is used to implement all interpolator and decimator stages of DUC and DDC for WiMAX system. 4.7 Conclusions Due to their high performance and facility to implement DSP function in efficient manner, FPGAs can be considered a better choice to increse the performance of broadband communication system like WiMAX. Also the availability of high level design tools helps in reducing the design cycle for FPGA implementation. DA can be used to implement low cost LUT based DSP functions either in serial or parallel form. When the number of elements in a vector is same as word size, DA results in fast operational speed. This fast speed is achieved by replacing multiplications by ROM based LUT. Decomposition technique and coding technique are used to reduce the ROM. FIR filters can be implemented using serial or parallel DA architecture. A parallel DA FIR filter produces one output for every clock cycle, whereas serial DA FIR filters requires M clock cycles to produce the output. Thus parallel architecture provides higher speed performance. Multibit serial architecture is another option which combines several small serial FIR units in parallel. This architecture provides greater throughput than the standard serial architectures, but less than parallel architecture. So to improve the performance in terms of speed, DA parallel architecture with pipeline level 1 is used for the proposed designs of interpolation filters and decimation filters of DUC and DDC for WiMAX system. 104