Institutionen för systemteknik

Size: px

Start display at page:

Download "Institutionen för systemteknik"

Mark Green
5 years ago
Views:

1 Institutionen för systemteknik Department of Electrical Engineering Examensarbete Real-Time Multi-Dimensional Fast Fourier Transforms on FPGAs Examensarbete utfört i Datorteknik vid Tekniska högskolan vid Linköpings universitet av Andreas Öhlin LiTH-ISY-EX--15/4854--SE Linköping 2015 Department of Electrical Engineering Linköpings universitet SE Linköping, Sweden Linköpings tekniska högskola Linköpings universitet Linköping

3 Real-Time Multi-Dimensional Fast Fourier Transforms on FPGAs Examensarbete utfört i Datorteknik vid Tekniska högskolan i Linköping av Andreas Öhlin LiTH-ISY-EX--15/4854--SE Handledare: Examinator: Mario Garrido isy, Linköpings universitet Oscar Gustafsson isy, Linköpings universitet Linköping, 12 June, 2015

5 Avdelning, Institution Division, Department Division of Computer Engineering Department of Electrical Engineering Linköpings universitet SE Linköping, Sweden Datum Date Språk Language Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats D-uppsats Övrig rapport ISBN ISRN LiTH-ISY-EX--15/4854--SE Serietitel och serienummer Title of series, numbering ISSN URL för elektronisk version Titel Title Multidimensionell realtids-fft på FPGA Real-Time Multi-Dimensional Fast Fourier Transforms on FPGAs Författare Author Andreas Öhlin Sammanfattning Abstract This thesis presents a way of performing multi dimensional FFT in a continuous flow environment by calculating the FFT of each dimension separately in a pipeline. The result is a three dimensional pipelined FFT implemented on a Stratix III FPGA. It can calculate the three dimensional FFT of a data set containing samples with a word size of 32 bits. The biggest challenge and the main part of the work are the data permutations in between the one dimensional FFT modules, this part of the design make use of an external DDR2 SDRAM as well as on-chip BRAM to store and permute data between the modules. The evaluations show that the design is hardware efficient and the latency is relatively low and determined to be 84.2 ms. Nyckelord Keywords real-time, multi dimensional, 3D, FFT, SDRAM, DDR2, continuous flow

7 Abstract This thesis presents a way of performing multi dimensional FFT in a continuous flow environment by calculating the FFT of each dimension separately in a pipeline. The result is a three dimensional pipelined FFT implemented on a Stratix III FPGA. It can calculate the three dimensional FFT of a data set containing samples with a word size of 32 bits. The biggest challenge and the main part of the work are the data permutations in between the one dimensional FFT modules, this part of the design make use of an external DDR2 SDRAM as well as on-chip BRAM to store and permute data between the modules. The evaluations show that the design is hardware efficient and the latency is relatively low and determined to be 84.2 ms. Sammanfattning Den här uppsatsen presenterar ett sätt att utföra multidimensionell fouriertransform i en omgivning med kontinuerligt flödande sample genom att beräkna transformen av varje dimension för sig i en pipeline. Resultatet är en tredimensionell pipelinad fouriertransform som är implementerad på en Stratix III FPGA. Denna klarar av att beräkna fouriertransformen av en indatastorlek på sampler som är 32 bitar breda. Den största utmaningen och centrala delen av designen är datapermutation, denna del använder sig av DDR2 SDRAM och inbyggda BRAM för att spara och permutera data mellan de endimensionella transformmodulerna. Utvärderingen visar att designen är hårdvarueffektiv och att fördröjningen är relativt låg och ligger på 84.2 ms. 5

9 Contents 1 Introduction Motivation Fast Fourier Transform FFT Architectures Multi Dimensional Fast Fourier Transform Architectures Challenges in MD FFT Design Pipelined or Iterative Architecture? Permutations in Multi Dimensional Fast Fourier Transforms Contribution Outline Background SDRAM Architecture Bit Dimension Permutation Introduction Notation and Inverse Permutation Periodicity and Address Mapping Previous Works D FFT for Real Time Applications Bandwidth Intense FPGA Architecture for Multi-Dimensional DFT High Performance 3D-FFT Implementation

10 8 Contents 2.4 Comparison of Available MD FFT Architectures Proposed 2D and 3D FFT Architecture Problem Formulation Pipelined or Iterative Architecture? Permutations on Memories with Access Limitations Proposed Approach Step 1 - External Memory? Step 2 - Extract Timings and Parameters Step 3 - Schedule Design Step 4 - Memory Permutation Design Step 5 - Auxiliary Permutation Circuit Design Implementation FFT Transposition and Bit Reversal on BRAM Bit Reversal Permutations on SDRAM Step 1 - Need of External Memory Step 2 - External Memory Constraints and Parameters Step 3 - Scheduling Step 4 - SDRAM Permutaion Design Step 5 - Auxiliary Permutation Circuit Performance Numbers Throughput Latency Hardware Utilization Comparison 65 6 Conclusions Future Works

11 Contents 9 Bibliography 71 List of Figures 2.1 SDRAM Memory Array Memory Interleaved Mapping Memory Access Pattern Data Set Slice Decomposition MD FFT Architecture CGRA PE Architecture Architecture Overview Schedule Command Order Ideal 3D Rotation Memory Permutation with Locked Bits Memory Permutation with Locked Bits and Bit Reversal General Auxiliary Permutation General Auxiliary Permutation with Bit Reversal System Overview with Permutations FFT Overview Block Random Access Memory (BRAM) Permutaion Overview BRAM Bit Order BRAM Permutation Architecture Bit Reversal Overview Bit Reversal Bit Order Bit Reversal Architecture SDRAM Overview SDRAM Bit Order Auxiliary Circuit Bit Order SDRAM Schedule SDRAM Permutation

12 10 Contents 4.14 SDRAM Permutation Architecture Auxiliary Permutation Circuit Architecture Comparison Throughput and External Locations per Sample Comparison Throughput per kbit Comparison Throughput versus FPGA Slices

13 Contents 11 List of Tables 2.1 Garrido Performance Numbers Yu 2D FFT Performance Numbers Yu 3D FFT Performance Numbers Virtex Yu 3D FFT Performance Numbers Virtex Nidhi Performance Numbers Comparison of MD FFTs in Previous Works Schedule Constraints Synthesis Results Comparison of MD FFTs in Previous Works

15 Acronyms BDP Bit Dimension Permutation BL Burst Length BRAM Block Random Access Memory CGRA Coarse Grain Reconfigurable Architecture DDR Double Data Rate DFT Discrete Fourier Transform DIF Decimation In Frequency DIT Decimation In Time FIFO First In First Out FFT Fast Fourier Transform 1D One Dimensional 2D Two Dimensional 3D Three Dimensional MD Multi Dimensional FPGA Field Programmable Gate Array MUX Multiplexer PHY PHYsical layer RAM Random Access Memory SAR Synthetic Aperture Radar SDR Single Data Rate SDRAM Synchronous Dynamic Random Access Memory SO-DIMM Small Outline Dual In-line Memory Module 13

17 Chapter 1 Introduction This thesis proposes a way of performing data permutations in a Multi Dimensional Fast Fourier Transform suited for calculations in a real-time continuous flow environment. This is done by a statically scheduled SDRAM controller together with address mappings and some additional permutation circuits designed with the aid of bit dimension permutation theory. The result is a three dimensional pipelined FFT capable of performing calculations on data sets of samples with a throughput of one sample per clock cycle. The system is designed for an Altera Stratix III FPGA and uses a DDR2 SDRAM memory for data permutation. In the following, a brief introduction to the main topics is presented, followed by the outline. A more detailed walk-through of the topics and previous works can be found in Chapter Motivation Multi Dimensional FFTs are computational kernels which are widely used in real world applications. The 2D FFT is for example used in Synthetic Aperture Radar (SAR) [18], and the 3D FFT can be utilized in motion detection [14], astrophysics, molecular dynamics, cosmology, reverse tomography, turbulence simulations [17] and SAR 3D reconstruction [10]. At the moment of writing, in-place or iterative 2D and 3D FFTs has been implemented in hardware, but only 2D FFTs exist with pipelined architectures. The goal of this thesis way to find a way to design a pipelined 3D FFT with a constant throughput using memory permutations which should be possible to perform on external memory and hence support larger input data sets. The motivation for this is to enable the use of 3D FFT in real-time systems demanding constant throughput of samples. 15

18 16 Introduction 1.2 Fast Fourier Transform The Fourier transform is a way of translating signals from time into frequency domain and vice versa. The Discrete Fourier Transform (DFT) is a variant of the Fourier transform on a discrete rather than continuous data set. The Fast Fourier Transform (FFT) was invented by Cooley and Tukey in 1965 [4]. It is an algorithm which reduces the amount of computer calculations needed to perform the DFT. Since then, the FFT has been widely used in many applications, and is still being used today. Research on effective implementations of the algorithm is still carried out FFT Architectures The FFT implementations in general can be categorized into two groups, inplace/iterative and pipelined architectures. An iterative architecture reuses one or more processing elements iteratively to calculate the result, while a pipelined architecture uses a series of processing elements to continuously calculate the result from a stream of samples. Both have advantages and disadvantages. The iterative approach is more hardware efficient since it reuses the same hardware for several calculations but is not suitable for continuous flow. The pipelined approach however is more suited for a continuous flow environment where low latency is important, but the cost of this higher performance is more hardware [8]. 1.3 Multi Dimensional Fast Fourier Transform The Multi Dimensional FFT (MD FFT) is used in various real world applications. Essentially it is an FFT applied on a multi dimensional data set Architectures The design of the MD FFT architecture can also be grouped into pipelined and iterative architectures. The pipelined architectures use dedicated hardware like pipelined 1D FFTs to enable a continuous flow of samples to be calculated [8]. On the contrary, iterative architectures such as [17, 20] are typically designed to be more flexible and use less resources by reusing the available processing elements, at the cost of lower performance, just as for 1D FFTs. 1.4 Challenges in MD FFT Design When designing an MD FFT for contemporary signal processing applications, the constraints for high throughput and low latency can be very stringent. As the

19 1.4 Challenges in MD FFT Design 17 resolutions of different sensors tend to increase, so does the amount of data to calculate. In real time applications this put tough demands on multi dimensional calculations, since the amount of data can be much larger than for one dimension and still have the same required throughput Pipelined or Iterative Architecture? The choice of architecture depend on the design goals. If the main goal is high performance, you should go for a pipelined architecture, if it is versatility or low area cost, you should go for iterative architectures. In the field of real time continuous flow processing, the overhead of an iterative architecture can be devastating to performance, in these cases it is much preferred to go for a pipelined alternative Permutations in Multi Dimensional Fast Fourier Transforms When performing an MD FFT you can choose to do so in various ways. The most straightforward is to apply a one dimensional FFT (1D FFT) to each dimension, and after calculating the FFT of one dimension, permute the data in such a way so that the order fits the subsequent FFT calculation. This part is the main challenge when designing such MD FFTs, since the amount of data grows fast with the size of the FFT if two or more dimensions are involved (n 2, n 3...). This requires large and fast memories with high flexibility on which to perform the permutations, this fact is valid for both pipelined and iterative architectures, since both approaches need to access a lot of data in preferably as low time as possible. But large memories with low access times are hard to find, and most contemporary MD FFT designs such as [8, 15, 17, 18, 20] use SDRAMs for these operations. SDRAMs are big and fast but have strict limitations on how to access data. You must first activate the partition called row of the memory where the data is stored in order to later read or manipulate it. Also only a limited number of rows can be activated at the same time [11, 12]. This means that you can not freely permute data on an SDRAM, you have to place the data inside the memory in such a way that you can perform the required permutations despite the limitations. Several approaches on how to solve this problem has been proposed, not only in the area of the FFT but also in image processing where similar problems exist. The works related to image processing [3, 13] focus on solving the access problem when transposing square blocks of pixels which is very common and present in for example JPEG coding. Baozhao et al. are mapping squares onto rows of the memory [3]. Kim et al. goes in the same direction but also put constraints on where to put contiguous squares in order to fetch the next square while calculating the present without activating new rows [13]. For the MD FFT there exist one permutation between every dimension. Between the first and second dimension a matrix transposition needs to be done, much like

20 18 Introduction the one described above for image processing. The difference is the significantly larger amount of data to transpose. In [1, 15] each two dimensional data set is divided into sub blocks which can fit inside a row much similar to the image processing approach. Dou et al. [5] does the opposite and map rows of data onto blocks in memory. Garrido [8] also go for the block onto a row solution but digs further into how to address the memory in a good way by using Bit Dimension Permutations (BDP). Between the second and third dimension of a 3D FFT the amount of data has grown much larger, from n 2 to n 3 when the data set is cubic, this is because we now need to perform a rotation or reordering of three dimensions instead of two. The solutions found in previous works on how to handle this only apply to iterative architectures, hence to our knowledge no pipelined 3D FFT has yet been published. Nidhi et al. [17] have a number of processing elements equal to the size of one of the three dimensions. Each processing element then has a local memory which can store a two dimensional slice of the data set. The sides of the slices equals the respective size of the two other dimensions. The processing element then performs a transposition locally followed by an exchange of data with the other processing elements, via an interconnect network, such that the required data permutation is performed. Yu et al. [20] tackle the problem in a different way. After calculating the first trivial dimension, a number of rows are read which are spread out in one of the other two dimensions. The FFT is then calculated on the column data of these rows. This is repeated until all rows have been calculated. Then blocks of subsequent rows of the same dimension is read and calculated. This is repeated for the third and last dimension, but also generalizable to more dimensions. 1.5 Contribution This thesis aims at finding a way of performing the permutation between the second and third dimension for pipelined architectures with a throughput of at least one sample per clock cycle using an SDRAM and bit dimension permutations. This is done by having a dedicated memory performing the permutation and an auxiliary permutation circuit to permute the parts that cannot be permuted on the SDRAM. This is then used to design a pipelined 3D FFT. 1.6 Outline This first chapter present an introduction to the difficulties of designing an MD FFT and the proposed solutions to these problems in previous works. In Chapter 2 we dig a bit deeper in the previous works and explain more what the solutions are about and also present an introduction to the theory behind the bit dimension permutations. After that we present the proposed architecture in Chapter 3, followed by an implementation example in Chapter 4. Finally, we discuss the

21 1.6 Outline 19 result and compare it to previous works in Chapter 5 followed by conclusions and suggestions for future works in Chapter 6.

23 Chapter 2 Background In this chapter we dig deeper into the previous works. We present the interesting solutions topic by topic in more detail and review their suitability for performing MD FFT in a continuous flow environment. To begin with we start with a presentation of the SDRAM architecture to ease the understanding of the access restrictions that is problematic when performing permutations, after which we present previous works on bit dimension permutation circuits and theory. This is followed by reviews of an excerpt of the most interesting existing MD FFT architectures regarding 2D and 3D FFT and real time calculations, their hardware cost and their performance in such an environment. To round it all up we present a comparison of the available MD FFT architectures. 2.1 SDRAM Architecture The SDRAM is a dynamic memory architecture, this means that the data stored inside it as charge. This charge is discharged due to leakage currents and because of this the data has to be refreshed periodically if not to be lost. There exist different standards for the SDRAM architecture and we will focus on the DDR2 standard issued by JEDEC [12]. This standard proposes an architecture where the memory cells are divided into banks, rows and columns. One column is simply one data word, one row is the amount of data that can be accessible for read and write operations in each bank and the number of banks determine how many rows can be open at any instance of time. In figure 2.1 we can identify the banks, and the rows are inside the sense amplifiers. Memory accesses to an SDRAM are burst oriented. This means that an access read or write starts at a given address and continue for a given burst length in a predetermined order, e.g. natural order 1, 2, 3... and so on. A memory location has to be activated before it can be accessed. This is done by sending an ACTIVATE command together with the row and bank address of the memory 21

24 22 Background location. Once the correct row is activated one can address the correct column on which to start the operation. Hence data manipulations can only be done inside or between open rows. To close an open row the PRECHARGE command is issued. This can be done automatically by indicating AUTO PRECHARGE when performing a read or write access [12]. The above mentioned characteristics are specifically for DDR2 SDRAMs but similarities can be found in other SDRAM architectures. Bank 1 Bank 2 Bank 3 Row Address Memory Array Bank 0 Column Address Sense Amp. Column Decoder Data Bus Figure 2.1. An example of a DDR2 SDRAM memory array. In Figure 2.1 we see a simplified sketch of the memory array of a DDR2 SDRAM memory. It consist of four memory banks Bank 0 - Bank 3. Each bank has its own sense amplifier which acts as a row buffer and enables a total of one open row per bank. The row to be activated is selected by the Row Decoder based on the Row Address. All outer data communication passes through the Data Bus and the Column Decoder selects which data words are to be affected by the current operation based on the Column Address. The latency for a DDR2 SDRAM is usually presented as CAS, trcd, trp and tras. CAS latency is the time it takes from issuing a column command to the SDRAM starts processing it. trcd is the time it takes from that you activate a row to when you can issue the first column command. trp is the bank precharge time which indicates how long you have to wait to activate a new row after precharging the previous row in the same bank. tras is the time you have to wait after

25 2.2 Bit Dimension Permutation 23 activating a row in a bank until you can precharge that same row. tras is the row command cycle time which indicates the minimum time between two row activations in the same bank [12]. Together with the allowed clock frequency these values give you an indication of how fast the memory is. 2.2 Bit Dimension Permutation Bit Dimension Permutation (BDP) is a way of performing any data permutation by interchanging or inverting bits in the sample indexes [6,8,9,16]. This is convenient since it is easy to control such permutations by using the bits of a binary counter which allows for hardware efficient permutation circuits with low complexity. First we will go through the basics and see how it works, and later on we will present some examples whose purpose is to ease the understanding Introduction Let us say that we have a series of samples and give each of them an individual binary index. The number of bits (the number of bit dimensions) needed to represent all the different indexes is then n = log 2 (N), where N is the number of samples. If we think in bit-dimensions (binary), every dimension can take value 1 or 0. This is the same as one bit in our sample index. You can move groups of data by interchanging index bits with one another or inverting them, by doing so you are reordering the indexed samples. Every possible reordering, also called permutation, can be broken down to these simple manipulations of bits, it is probably necessary to do a series of them depending on the complexity of the permutation to perform, but the method remains the same. More theory on how this work in practice and the different kinds of permutations that is achieved by them can be found in [8]. The movements of index bits can in the case of a real-time system be seen as different delays applied on samples arriving serially at the input. The method of applying delays to samples is proposed in [9] in order to make optimal bit reversal circuits for serial data Notation and Inverse Permutation We annotate a BDP as a function σ of the different index bits u x in the following manner. σ(u n 1,..., u 1, u 0 ) = u 0, u 1,..., u n 1 (2.1) The example above is a bit reversal of n index bits, this can be seen by noting that the order is completely reversed, i.e. u n 1 has switched place with u 0, u n 2 with u 1 and so on. This is of course not the only permutation that can be formulated

26 24 Background in this way but it is a very important one in the area of FFT since the result is bit reversed if the input data is in natural order and vice versa [8]. In the case of a dedicated BDP circuit, the way of mapping the bits of a counter onto the required control signals is presented in [9]. If we are to perform the permutation on a memory we will instead use different address mappings as our way of permuting the data. These mappings describe how each counter bit will be mapped to the memory address bits. These mappings will have to change periodically if we would like to avoid double buffering. In double buffering you use two buffers instead of one and write to one buffer when you read from the other, once you have emptied the reading and filled up the writing buffer you switch so that you write into the buffer you previously read from and vice versa. This can be avoided by first reading and then writing to the same location in memory at all times. For now we permit ourselves to use double buffers and can therefore manage with just designing one mapping for each permutation assuming that we always write samples in natural order. In this way we can take a closer look on how to map the counter bits in order to perform a given permutation. Later in the following section on periodicity and address mapping we will treat the case when we do not allow double buffering. Once we have a determined permutation to perform and we have formulated it as a σ-function as above in equation 2.1, the way of mapping the counter bits onto the memory address is identical to the inverse permutation σ 1 as defined in [8] and presented in equation 2.2. σ(u) = u σ 1 (u ) = u (2.2) If the inverse permutation is applied to the result of the permutation the result of them both will be the identity function Id which is defined as Id(u) = u [8], hence it would be the same as not permute at all. This is true for all permutations. When you read and write to the same location, the difference is that you in most cases will have to define several permutations that are applied after another instead of just one in order to make it work Periodicity and Address Mapping The periodicity of a BDP is defined as the number of times you have to apply the permutation until you reach the identity function, this can be formulated as σ k = Id, where k is the periodicity [8]. In the case of bit reversal the periodicity is equal to two, hence it is also its own inverse permutation. This can be proven by reversing the bits twice and verify that the bit order is again natural. Periodicity is important when we want to avoid double buffering when permuting on a memory, because it defines how many address mappings we have to use in order to make it work, and hence it also gives a hint about the complexity of the

27 2.3 Previous Works 25 hardware. The number of address mappings will always be at least the periodicity. Let us take again the bit reversal as our first example. If the data set is written in natural order, we will have to read them out in bit reversed order to perform the permutation without additional hardware. If our permutation is to handle continuous flow without double buffering then we will receive the next data set while we output the previous, hence we will write the next data set using the bit reversed mapping at the same time as we output the first, and later read them out using the natural ordered mapping again while receiving the third data set. If we would have had a permutation with higher periodicity the number of mappings would have increased, and hence also the period of circulating through them and apply them onto the data sets. Each data set will though only be affected by two address mappings, one for writing and one for reading, and the resulting permutation of these two mappings should in all cases be the permutation which we were aiming to perform. This is the more difficult part when performing more complicated permutations, such as the Three Dimensional (3D) rotation performed on a SDRAM memory with access limitations that reorders all three dimensions of the 3D data set in a 3D FFT. 2.3 Previous Works This section presents in more detail the existing designs of MD FFTs which are most relevant to the topic. It is explained how they work and also their performance and hardware cost D FFT for Real Time Applications In [8] a pipelined 2D FFT design approach and example are presented, these are tailored for real time continuous flow applications. The design can be used with any ordinary 1D FFT. The idea is to put two 1D FFTs in series, together with a transposition in between them to reorder data so that if you feed the first FFT with data from the 2D input data set row-wise, then the second FFT will receive it column-wise. It can be shown that this is a valid approach by splitting up the two sums of the equation for the 2D DFT into an outer sum for one dimension and an inner sum for the other. Hence, one of the sums can be calculated on top of the result of the other and that is why each dimension can have its own 1D FFT and the result can be calculated one dimension at a time. Functional Description Since ordinary 1D FFTs are used the main challenge is to design the permutation circuit to perform the transposition between them, this permutation circuit per-

28 26 Background forms bit dimension permutation and is based on a counter and a memory. The counter is used to count the incoming samples arriving one sample per clock cycle. The design treats four complications together with any combination of them: The system may receive the first sample of the next data set the clock cycle after the last sample of the previous data set, the access to the memory might be limited, there might be a need for using several memories in parallel in order to achieve desired throughput of one sample per clock cycle and fourthly, the data set to calculate do not have to be square but may be non-square. The transposition to be performed can be formulated as the following BDP presented in equation 2.3, where n is the total number of bit dimensions in the index and j is the number of column index bits. The special case of a square data set appears if j is replaced by n/2. σ(u n 1,..., u j, u j 1,..., u 0 ) = u j 1,..., u 0, u n 1,..., u j (2.3) }{{}}{{} Rows Columns As can be realized by applying the permutation twice to the special case of a square data set, its periodicity is two. Hence, a transposition of a square matrix is its own inverse and the address mapping is the same as the permutation. This is though only valid for the cases when a memory without access limitations is used. If the data set is non-square the periodicity is no longer two, and in that case more than one address mapping is necessary in order to design a functional system. If an SDRAM is used for performing the permutation, and it is only possible to access one row at a time (the bank address bits can be interpreted as column address bits, since they can be accessed without limitation), the solution is to map maximum sized squares of data onto rows of the memory, note that the data set do not have to be square, only divided into smaller squares. The squares sides are all equal to L and the number of index bits to address its samples is λ = log 2 (L). This is dependent on that L, as well as the number of rows and columns of the data set, are powers of two. If the number of columns in each row of the memory is M C, it implies that L 2 M C. Hence, the first λ row and column bits are mapped to column bits of the memory to ensure access both row and column-wise inside the square to equalize the row changing overhead, so that it is the same for each address mapping. The bigger the squares can be, the lower the overhead. Therefore it is advised to maximize the value of L. If the data set is square the permutation remains its own inverse and hence the periodicity is still two. If, on the other hand, the data set is non-square and hence the periodicity not equal to two but a number k, then L is further limited since more address bits have to be mapped to columns in order to realize more address mappings. Instead of the previous limit of L 2 M C it is now limited by L k M C due to the periodicity. In practice this means that the SDRAM overhead for calculating non-square data sets should be larger than that of square data sets for comparable sizes. Furthermore, if the permutation is not possible to perform in its entirety inside

29 2.3 Previous Works 27 N C. L N C. L L L Figure 2.2. An example of how to map squares of data onto different memories in an interleaved fashion. This scheme is made for reading and writing to all, in this example four, memories simultaneously. the memory a small auxiliary permutation circuit is used. Together they enable the permutation to be performed on a constant flow of data sets. In the last scenario when several memories are used, the squares are mapped onto rows in the different memories in an interleaved fashion as depicted in Figure 2.2. Here C is the number of memories, N is the number of samples in each row and column of the data set and L is the side of each square. As you can see this data set is square but it does not necessary have to be so. Size Slices BRAMs Clk Freq. (MHz) Latency (ms) Table 2.1. The performance numbers of the 2D FFT for real time applications proposed in [8]. Hardware Cost and Performance As this design is a pipelined 2D FFT, it is designed to maintain a constant throughput. In this case it is set to be one sample per clock cycle (1 Sample/clk). It is

30 28 Background also parameterized so it can easily be reconfigured to different data set sizes, including rectangular ones. In Table 2.1 we can see examples of hardware usage and achievable clock frequency for some given 2D FFT sizes. The sample word length is 2 16 bits. The design is implemented in a Virtex 5 FPGA and also uses 4 Micron MT46V32M16 SDRAM chips with a total memory capacity of 256 MB. The reads and writes to and from these external SDRAMs are only performed once per sample, which means that in total two memory accesses per sample are necessary. The throughput in MSamples/s is equal to the clock frequency column of table Bandwidth Intense FPGA Architecture for Multi-Dimensional DFT This design which is proposed in [19] belongs to the iterative family of FFT architectures and performs 2D and 3D FFT by reading data from SDRAM and load it onto two local memories that functions as ping pong buffers. It performs Row FFT on spread rows followed by a so called column stride FFT followed by twiddle multiplications on the columns of those rows. The next step is reading of whole or parts of contiguous rows and performing so called column local FFT and columnwise permutation on these rows, once this is done for all rows of the data set the 2D FFT has been calculated. To calculate 3D FFT the 2D FFT procedure is carried out for all 2D slices of the 3D data set, then stride FFT, twiddle multiplication, local FFT and permutation is performed in the third dimension, much similar to the column operations of the 2D FFT. Functional Description The size of each dimension is N d where d is the index of the dimension. The size of each local memory is denoted S. In the first step focus is on row operations. The number of rows to be read for each iteration of this step is m, and the spacing between each such row is p, the number of rows is then N 2 = m p. The number of iterations of the step is p. During one step, the row-wise FFTs are calculated for all rows, after this, column-wise FFT of size m on the columns of the rows is applied, followed by twiddle multiplications. When this is done the result is written back to memory. In the second step, L contiguous rows with B elements per row are read from memory and column-wise FFTs of size p is performed on these rows followed by a column-wise permutation. Once this step is finished the 2D FFT calculation is complete. The memory accesses are illustrated in Figure 2.3 where each row of the data set is mapped to a row in the memory.

31 2.3 Previous Works 29 N1 N1 N1 p p p p L accessed along the rows p p N2 N2 N2 a) Row DFT b) Column Stride DFT c) Column Local DFT Figure 2.3. The access patterns of step one and two in the procedure of calculating the 2D FFT. Based on figure in [19]. d3 d2 d1 N3 N1 N3 N1 N2 N2 a) 2D DFT on each slice b) DFT along d3 on each slice Figure 2.4. The decomposition of a 3D data set into 2D slices. Based on figure in [19]. If 3D FFT is to be calculated then this process is repeated for every slice of the 3D data set spanned by the first and second dimensions. The slice decomposition can be seen in Figure 2.4. The architecture used to perform these calculations is depicted in Figure 2.5. Hardware Cost and Performance This design has also been implemented on a Virtex 5 FPGA. The number of occupied slices are 8,273 and it uses 68 DSP48E blocks in addition to 87 BRAMs for the example case in which the maximum FFT length is 2048 samples. The clock frequency is also constant at 100 MHz. The achieved latency and throughput of the 2D and 3D FFT can be seen in Table 2.2 and 2.3 respectively. The sample word length is 2 32 bits.

32 30 Background PE Local Memory Controller 1D FFT PLB SDRAM SDRAM Controller (MPMC) Switch Local Memory 0 Local Memory 1 Switch 1D FFT 1D FFT Twiddle Factor ROM Proposed FFT Architecture PE Array FPGA Figure 2.5. The proposed architecture used for calculating the 2D and 3D FFT. Based on figure in [19]. Size Latency (ms) Throughput (MSamples/s) Table 2.2. The performance numbers of the 2D FFT proposed in [19]. Size Latency (ms) Throughput (MSamples/s) Table 2.3. The performance numbers of the 3D FFT on Virtex 5 proposed in [19]. This design has also been simulated on a Virtex 6 FPGA in [20], so no hardware usage has been reported, but the acquired performance for the 3D FFT is presented in Table 2.4. For all cases, this architecture uses an external memory of at least twice the size of the data set.

33 2.3 Previous Works 31 Size Latency (ms) Throughput (MSamples/s) Table 2.4. The performance numbers of the 3D FFT on Virtex 6 proposed in [20] High Performance 3D-FFT Implementation A way of performing the 3D FFT on a Coarse Grain Reconfigurable Architecture (CGRA) implemented inside a Field Programmable Gate Array (FPGA) is proposed in [17]. It is an iterative architecture based on a network of processing elements, each consisting of a processor capable of performing butterfly operations, an instruction memory and a data memory. The permutations of data are performed by reconfiguring the interconnection network that connects each processing element to its neighbors. Sequencer Data Memory Instr. Memory DSP48E Output Figure 2.6. The architecture of a processing element in the CGRA. Functional Description The system consists of a set of processing elements, each capable of performing butterfly operations which consist of complex rotation, addition and subtraction. Its architecture can be found in Figure 2.6. Its brain is a sequencer which takes instructions from the instruction memory, which it also can manipulate, and initiates operations in the DSP48E slice, this is a hardware macro available in the

34 32 Background Virtex 5 FPGA family which is used for this design. The data memory provides two read ports and one write port enabling two operands to be read simultaneously to the DSP slice while allowing write-back of result. The data memory is also accessible from a neighbor via the reconfigurable interconnection network. The processing element is also capable of accessing another neighbor such that the processing elements all form a semi-systolic array with reconfigurable chains for data movement. When the amount of data is too large to fit inside the data memories of the processing elements, a part of it is stored in external SDRAM. This implies that a part of the 3D FFT is performed on the available data inside the processing elements, then this data is written back to memory and new data is loaded into the data memories. This is iterated until all computations has been performed. Each processing element has four possible neighbor connection points whereof only two can be active at a time, one for input and one for output. Because of this, data has to be placed, and the interconnection network configured in such a way that each processing network can access the required data to perform its operation at any given time. The details of the data movements are a bit unclear, but the results show that a very large amount of reconfigurations are needed when the size of the 3D data set approaches 256 3, indicating a large overhead when only a small amount of the data set can fit inside the memories of the processing elements. Hardware Cost and Performance This design is implemented on a Virtex 5 FPGA and uses 1728 slices for the 32 3 sized 3D FFT in addition to 64 DSP48E blocks and 148 BRAMs. The clock frequency is reported to be 300 MHz for the parts that performs the 3D FFT calculations. The performance numbers for this architecture can be found in Table 2.5. The word length of this design is 48 bits. Size Latency (ms) Throughput (MSamples/s) 4x4x x8x x16x x32x x64x x128x x256x Table 2.5. The performance numbers of the 3D FFT in [17]. It is unclear exactly how many times data has to be written and read from memory. For the cases where all data can be accommodated in local memories of the processing elements, we assume that only two accesses, one for read and one for write, is necessary. For the other cases we do not know how many times this has

35 2.4 Comparison of Available MD FFT Architectures 33 to be performed, but the decreasing performance with increasing data set size indicates that the number of accesses increase significantly when the data set can no longer fit in the local memory of the processing elements. The size of the external memory is also unclear but has to be at least the same size as the data set. 2.4 Comparison of Available MD FFT Architectures In general it has been difficult to extract all information from the different architectures regarding hardware usage and more detailed information on how the algorithm works in order to determine the number of memory accesses needed for a given MD FFT size. Anyhow, the information we have been able to extract is gathered in Table 2.6. Here we can see that Garrido s Virtex 5 2D FFT and Yu s Virtex 6 3D FFT architectures provide significantly higher throughputs than the competition. It should be noted here that the hardware usage for Yu s, although an iterative architecture, implementation probably exceeds that of Garrido based on the information we have from the D FFT.

36 34 Background FPGA Memory External DSP blocks / Throughput Type Author Family Size (int Kbits/ Locations / Slices (Mbits/s) ext MBytes) Sample 2D FFT 3D FFT Garrido Yu Yu Nidhi Virtex (0 / 1) 1 0 / Virtex (0 / 4) 1 0 / Virtex (72 / 4) 1 0 / Virtex (? /?)?? /? Virtex (? /?)?? /? Virtex (3132 /?)? 68 / Virtex (2088 / 128) 8 68 / Virtex (2088 / 256) 8 68 / Virtex (? / 128) 8? /? Virtex (? / 1024) 8? /? Virtex (3552 /?)? 64 / Virtex (3552 /?)? 64 / Table 2.6. A comparison of the MD FFTs presented in the previous works treated in this chapter.

37 Chapter 3 Proposed 2D and 3D FFT Architecture If we limit ourselves to only the solution which calculates the multi dimensional FFT one dimension at a time we can see that it present some general difficulties which will be presented in the problem formulation. Once we have gone through the details of the complications, we present the proposed approach to design a general 2D and 3D FFT architecture. Throughout this chapter we are making the assumption that the requirements are those of a real-time continuous flow application providing and requiring one sample per clock cycle. The 3D data sets arrive directly after another providing a constant stream of input samples. 3.1 Problem Formulation When in a real-time environment, tough constraints for both latency and throughput are common. The typical case is that we are located inside a processing chain, being provided data each clock cycle and required to deliver results at the same rate, within the latency requirement. One example of a hard real-time application could be post processing of a video stream that has to be seen in real-time for crucial decision making. If the latency would be too high, the decision maker would perhaps not be able to make the correct decision in time, with potentially catastrophic consequences. Another example could be real-time medical body scanning, in order to provide the doctors with the necessary information in time and to enable fast updates of the image, a low latency is required, and also the higher the throughput the shorter the scan will take, which increases the throughput of patients through the body scanner. 35

38 36 Proposed 2D and 3D FFT Architecture Pipelined or Iterative Architecture? Given these situations we can see that there is a need for fast computation of both the 2D and 3D FFT in real-time. When inside a continuous flow chain, the iterative architectures halt a bit. The reason for this is the repeated accesses to a memory for loading and storing intermediate results. Say for example that we for the 3D FFT have to read and write the data set three or four times, then the bandwidth to the memory has to be three or four times that of the processing chain and include a mandatory clock domain crossing (CDC) or use of several external memories. From experience CDCs are known to be tedious to debug if they are not designed in a good manner. Hence, we can conclude that iterative architectures are not suitable but possible to use in real-time application at the cost of very fast or several memories and advanced circuitry. Pipelined architectures, on the other hand, are inherently adapted to a continuous flow of samples. This enables the 1D FFTs to be calculated without memory accesses but with the cost of more hardware and mandatory permutations between the dimensions, these require memories but only read and write data once so the memory bandwidth can remain that of the processing chain. For small data sets these permutations can be performed in on-chip memories and hence not complicate things more than the hardware usage. On the other hand, for large data sets an external memory has typically to be used and this presents constraints to the permutations if the memory/memories have access limitations Permutations on Memories with Access Limitations When performing a 3D FFT, a part of it is a 2D FFT. In fact you are calculating the 2D FFT of each slice spanned by two of the three dimensions, followed by FFT calculation of the third dimension. Therefore, the first part of a pipelined 3D FFT is exactly the same as a pipelined 2D FFT as presented in [8], namely a 1D FFT followed by a transposition unit and another 1D FFT. The permutation performing the transposition is therefore present in both 2D and 3D FFTs. The 3D FFT has also another permutation to be performed between the second and third 1D FFT blocks, this permutation has to permute the whole 3D data set such that data is delivered in the third dimension to the third 1D FFT. If the amount of data is sufficiently large, one or both of these permutations have to be performed on external memory. Typically, the memory architecture used is SDRAM, chosen for its properties of providing both large memory size and also high speed. But SDRAMs are dynamic memories which need to be refreshed, and you can only access a limited activated amount of memory at a time. This presents some difficulties when performing real-time permutations as in the pipelined 2D and 3D FFT. Firstly, SDRAMs are burst oriented. This means that you have to access a series of samples at a time. In terms of permutations this means that the lowest address

Institutionen för systemteknik

Institutionen för systemteknik Department of Electrical Engineering Examensarbete Design and Implementation of a DMA Controller for Digital Signal Processor Examensarbete utfört i Datorteknik vid Tekniska