Institutionen för systemteknik

Size: px
Start display at page:

Download "Institutionen för systemteknik"

Transcription

1 Institutionen för systemteknik Department of Electrical Engineering Examensarbete Real-Time Multi-Dimensional Fast Fourier Transforms on FPGAs Examensarbete utfört i Datorteknik vid Tekniska högskolan vid Linköpings universitet av Andreas Öhlin LiTH-ISY-EX--15/4854--SE Linköping 2015 Department of Electrical Engineering Linköpings universitet SE Linköping, Sweden Linköpings tekniska högskola Linköpings universitet Linköping

2

3 Real-Time Multi-Dimensional Fast Fourier Transforms on FPGAs Examensarbete utfört i Datorteknik vid Tekniska högskolan i Linköping av Andreas Öhlin LiTH-ISY-EX--15/4854--SE Handledare: Examinator: Mario Garrido isy, Linköpings universitet Oscar Gustafsson isy, Linköpings universitet Linköping, 12 June, 2015

4

5 Avdelning, Institution Division, Department Division of Computer Engineering Department of Electrical Engineering Linköpings universitet SE Linköping, Sweden Datum Date Språk Language Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats D-uppsats Övrig rapport ISBN ISRN LiTH-ISY-EX--15/4854--SE Serietitel och serienummer Title of series, numbering ISSN URL för elektronisk version Titel Title Multidimensionell realtids-fft på FPGA Real-Time Multi-Dimensional Fast Fourier Transforms on FPGAs Författare Author Andreas Öhlin Sammanfattning Abstract This thesis presents a way of performing multi dimensional FFT in a continuous flow environment by calculating the FFT of each dimension separately in a pipeline. The result is a three dimensional pipelined FFT implemented on a Stratix III FPGA. It can calculate the three dimensional FFT of a data set containing samples with a word size of 32 bits. The biggest challenge and the main part of the work are the data permutations in between the one dimensional FFT modules, this part of the design make use of an external DDR2 SDRAM as well as on-chip BRAM to store and permute data between the modules. The evaluations show that the design is hardware efficient and the latency is relatively low and determined to be 84.2 ms. Nyckelord Keywords real-time, multi dimensional, 3D, FFT, SDRAM, DDR2, continuous flow

6

7 Abstract This thesis presents a way of performing multi dimensional FFT in a continuous flow environment by calculating the FFT of each dimension separately in a pipeline. The result is a three dimensional pipelined FFT implemented on a Stratix III FPGA. It can calculate the three dimensional FFT of a data set containing samples with a word size of 32 bits. The biggest challenge and the main part of the work are the data permutations in between the one dimensional FFT modules, this part of the design make use of an external DDR2 SDRAM as well as on-chip BRAM to store and permute data between the modules. The evaluations show that the design is hardware efficient and the latency is relatively low and determined to be 84.2 ms. Sammanfattning Den här uppsatsen presenterar ett sätt att utföra multidimensionell fouriertransform i en omgivning med kontinuerligt flödande sample genom att beräkna transformen av varje dimension för sig i en pipeline. Resultatet är en tredimensionell pipelinad fouriertransform som är implementerad på en Stratix III FPGA. Denna klarar av att beräkna fouriertransformen av en indatastorlek på sampler som är 32 bitar breda. Den största utmaningen och centrala delen av designen är datapermutation, denna del använder sig av DDR2 SDRAM och inbyggda BRAM för att spara och permutera data mellan de endimensionella transformmodulerna. Utvärderingen visar att designen är hårdvarueffektiv och att fördröjningen är relativt låg och ligger på 84.2 ms. 5

8

9 Contents 1 Introduction Motivation Fast Fourier Transform FFT Architectures Multi Dimensional Fast Fourier Transform Architectures Challenges in MD FFT Design Pipelined or Iterative Architecture? Permutations in Multi Dimensional Fast Fourier Transforms Contribution Outline Background SDRAM Architecture Bit Dimension Permutation Introduction Notation and Inverse Permutation Periodicity and Address Mapping Previous Works D FFT for Real Time Applications Bandwidth Intense FPGA Architecture for Multi-Dimensional DFT High Performance 3D-FFT Implementation

10 8 Contents 2.4 Comparison of Available MD FFT Architectures Proposed 2D and 3D FFT Architecture Problem Formulation Pipelined or Iterative Architecture? Permutations on Memories with Access Limitations Proposed Approach Step 1 - External Memory? Step 2 - Extract Timings and Parameters Step 3 - Schedule Design Step 4 - Memory Permutation Design Step 5 - Auxiliary Permutation Circuit Design Implementation FFT Transposition and Bit Reversal on BRAM Bit Reversal Permutations on SDRAM Step 1 - Need of External Memory Step 2 - External Memory Constraints and Parameters Step 3 - Scheduling Step 4 - SDRAM Permutaion Design Step 5 - Auxiliary Permutation Circuit Performance Numbers Throughput Latency Hardware Utilization Comparison 65 6 Conclusions Future Works

11 Contents 9 Bibliography 71 List of Figures 2.1 SDRAM Memory Array Memory Interleaved Mapping Memory Access Pattern Data Set Slice Decomposition MD FFT Architecture CGRA PE Architecture Architecture Overview Schedule Command Order Ideal 3D Rotation Memory Permutation with Locked Bits Memory Permutation with Locked Bits and Bit Reversal General Auxiliary Permutation General Auxiliary Permutation with Bit Reversal System Overview with Permutations FFT Overview Block Random Access Memory (BRAM) Permutaion Overview BRAM Bit Order BRAM Permutation Architecture Bit Reversal Overview Bit Reversal Bit Order Bit Reversal Architecture SDRAM Overview SDRAM Bit Order Auxiliary Circuit Bit Order SDRAM Schedule SDRAM Permutation

12 10 Contents 4.14 SDRAM Permutation Architecture Auxiliary Permutation Circuit Architecture Comparison Throughput and External Locations per Sample Comparison Throughput per kbit Comparison Throughput versus FPGA Slices

13 Contents 11 List of Tables 2.1 Garrido Performance Numbers Yu 2D FFT Performance Numbers Yu 3D FFT Performance Numbers Virtex Yu 3D FFT Performance Numbers Virtex Nidhi Performance Numbers Comparison of MD FFTs in Previous Works Schedule Constraints Synthesis Results Comparison of MD FFTs in Previous Works

14

15 Acronyms BDP Bit Dimension Permutation BL Burst Length BRAM Block Random Access Memory CGRA Coarse Grain Reconfigurable Architecture DDR Double Data Rate DFT Discrete Fourier Transform DIF Decimation In Frequency DIT Decimation In Time FIFO First In First Out FFT Fast Fourier Transform 1D One Dimensional 2D Two Dimensional 3D Three Dimensional MD Multi Dimensional FPGA Field Programmable Gate Array MUX Multiplexer PHY PHYsical layer RAM Random Access Memory SAR Synthetic Aperture Radar SDR Single Data Rate SDRAM Synchronous Dynamic Random Access Memory SO-DIMM Small Outline Dual In-line Memory Module 13

16

17 Chapter 1 Introduction This thesis proposes a way of performing data permutations in a Multi Dimensional Fast Fourier Transform suited for calculations in a real-time continuous flow environment. This is done by a statically scheduled SDRAM controller together with address mappings and some additional permutation circuits designed with the aid of bit dimension permutation theory. The result is a three dimensional pipelined FFT capable of performing calculations on data sets of samples with a throughput of one sample per clock cycle. The system is designed for an Altera Stratix III FPGA and uses a DDR2 SDRAM memory for data permutation. In the following, a brief introduction to the main topics is presented, followed by the outline. A more detailed walk-through of the topics and previous works can be found in Chapter Motivation Multi Dimensional FFTs are computational kernels which are widely used in real world applications. The 2D FFT is for example used in Synthetic Aperture Radar (SAR) [18], and the 3D FFT can be utilized in motion detection [14], astrophysics, molecular dynamics, cosmology, reverse tomography, turbulence simulations [17] and SAR 3D reconstruction [10]. At the moment of writing, in-place or iterative 2D and 3D FFTs has been implemented in hardware, but only 2D FFTs exist with pipelined architectures. The goal of this thesis way to find a way to design a pipelined 3D FFT with a constant throughput using memory permutations which should be possible to perform on external memory and hence support larger input data sets. The motivation for this is to enable the use of 3D FFT in real-time systems demanding constant throughput of samples. 15

18 16 Introduction 1.2 Fast Fourier Transform The Fourier transform is a way of translating signals from time into frequency domain and vice versa. The Discrete Fourier Transform (DFT) is a variant of the Fourier transform on a discrete rather than continuous data set. The Fast Fourier Transform (FFT) was invented by Cooley and Tukey in 1965 [4]. It is an algorithm which reduces the amount of computer calculations needed to perform the DFT. Since then, the FFT has been widely used in many applications, and is still being used today. Research on effective implementations of the algorithm is still carried out FFT Architectures The FFT implementations in general can be categorized into two groups, inplace/iterative and pipelined architectures. An iterative architecture reuses one or more processing elements iteratively to calculate the result, while a pipelined architecture uses a series of processing elements to continuously calculate the result from a stream of samples. Both have advantages and disadvantages. The iterative approach is more hardware efficient since it reuses the same hardware for several calculations but is not suitable for continuous flow. The pipelined approach however is more suited for a continuous flow environment where low latency is important, but the cost of this higher performance is more hardware [8]. 1.3 Multi Dimensional Fast Fourier Transform The Multi Dimensional FFT (MD FFT) is used in various real world applications. Essentially it is an FFT applied on a multi dimensional data set Architectures The design of the MD FFT architecture can also be grouped into pipelined and iterative architectures. The pipelined architectures use dedicated hardware like pipelined 1D FFTs to enable a continuous flow of samples to be calculated [8]. On the contrary, iterative architectures such as [17, 20] are typically designed to be more flexible and use less resources by reusing the available processing elements, at the cost of lower performance, just as for 1D FFTs. 1.4 Challenges in MD FFT Design When designing an MD FFT for contemporary signal processing applications, the constraints for high throughput and low latency can be very stringent. As the

19 1.4 Challenges in MD FFT Design 17 resolutions of different sensors tend to increase, so does the amount of data to calculate. In real time applications this put tough demands on multi dimensional calculations, since the amount of data can be much larger than for one dimension and still have the same required throughput Pipelined or Iterative Architecture? The choice of architecture depend on the design goals. If the main goal is high performance, you should go for a pipelined architecture, if it is versatility or low area cost, you should go for iterative architectures. In the field of real time continuous flow processing, the overhead of an iterative architecture can be devastating to performance, in these cases it is much preferred to go for a pipelined alternative Permutations in Multi Dimensional Fast Fourier Transforms When performing an MD FFT you can choose to do so in various ways. The most straightforward is to apply a one dimensional FFT (1D FFT) to each dimension, and after calculating the FFT of one dimension, permute the data in such a way so that the order fits the subsequent FFT calculation. This part is the main challenge when designing such MD FFTs, since the amount of data grows fast with the size of the FFT if two or more dimensions are involved (n 2, n 3...). This requires large and fast memories with high flexibility on which to perform the permutations, this fact is valid for both pipelined and iterative architectures, since both approaches need to access a lot of data in preferably as low time as possible. But large memories with low access times are hard to find, and most contemporary MD FFT designs such as [8, 15, 17, 18, 20] use SDRAMs for these operations. SDRAMs are big and fast but have strict limitations on how to access data. You must first activate the partition called row of the memory where the data is stored in order to later read or manipulate it. Also only a limited number of rows can be activated at the same time [11, 12]. This means that you can not freely permute data on an SDRAM, you have to place the data inside the memory in such a way that you can perform the required permutations despite the limitations. Several approaches on how to solve this problem has been proposed, not only in the area of the FFT but also in image processing where similar problems exist. The works related to image processing [3, 13] focus on solving the access problem when transposing square blocks of pixels which is very common and present in for example JPEG coding. Baozhao et al. are mapping squares onto rows of the memory [3]. Kim et al. goes in the same direction but also put constraints on where to put contiguous squares in order to fetch the next square while calculating the present without activating new rows [13]. For the MD FFT there exist one permutation between every dimension. Between the first and second dimension a matrix transposition needs to be done, much like

20 18 Introduction the one described above for image processing. The difference is the significantly larger amount of data to transpose. In [1, 15] each two dimensional data set is divided into sub blocks which can fit inside a row much similar to the image processing approach. Dou et al. [5] does the opposite and map rows of data onto blocks in memory. Garrido [8] also go for the block onto a row solution but digs further into how to address the memory in a good way by using Bit Dimension Permutations (BDP). Between the second and third dimension of a 3D FFT the amount of data has grown much larger, from n 2 to n 3 when the data set is cubic, this is because we now need to perform a rotation or reordering of three dimensions instead of two. The solutions found in previous works on how to handle this only apply to iterative architectures, hence to our knowledge no pipelined 3D FFT has yet been published. Nidhi et al. [17] have a number of processing elements equal to the size of one of the three dimensions. Each processing element then has a local memory which can store a two dimensional slice of the data set. The sides of the slices equals the respective size of the two other dimensions. The processing element then performs a transposition locally followed by an exchange of data with the other processing elements, via an interconnect network, such that the required data permutation is performed. Yu et al. [20] tackle the problem in a different way. After calculating the first trivial dimension, a number of rows are read which are spread out in one of the other two dimensions. The FFT is then calculated on the column data of these rows. This is repeated until all rows have been calculated. Then blocks of subsequent rows of the same dimension is read and calculated. This is repeated for the third and last dimension, but also generalizable to more dimensions. 1.5 Contribution This thesis aims at finding a way of performing the permutation between the second and third dimension for pipelined architectures with a throughput of at least one sample per clock cycle using an SDRAM and bit dimension permutations. This is done by having a dedicated memory performing the permutation and an auxiliary permutation circuit to permute the parts that cannot be permuted on the SDRAM. This is then used to design a pipelined 3D FFT. 1.6 Outline This first chapter present an introduction to the difficulties of designing an MD FFT and the proposed solutions to these problems in previous works. In Chapter 2 we dig a bit deeper in the previous works and explain more what the solutions are about and also present an introduction to the theory behind the bit dimension permutations. After that we present the proposed architecture in Chapter 3, followed by an implementation example in Chapter 4. Finally, we discuss the

21 1.6 Outline 19 result and compare it to previous works in Chapter 5 followed by conclusions and suggestions for future works in Chapter 6.

22

23 Chapter 2 Background In this chapter we dig deeper into the previous works. We present the interesting solutions topic by topic in more detail and review their suitability for performing MD FFT in a continuous flow environment. To begin with we start with a presentation of the SDRAM architecture to ease the understanding of the access restrictions that is problematic when performing permutations, after which we present previous works on bit dimension permutation circuits and theory. This is followed by reviews of an excerpt of the most interesting existing MD FFT architectures regarding 2D and 3D FFT and real time calculations, their hardware cost and their performance in such an environment. To round it all up we present a comparison of the available MD FFT architectures. 2.1 SDRAM Architecture The SDRAM is a dynamic memory architecture, this means that the data stored inside it as charge. This charge is discharged due to leakage currents and because of this the data has to be refreshed periodically if not to be lost. There exist different standards for the SDRAM architecture and we will focus on the DDR2 standard issued by JEDEC [12]. This standard proposes an architecture where the memory cells are divided into banks, rows and columns. One column is simply one data word, one row is the amount of data that can be accessible for read and write operations in each bank and the number of banks determine how many rows can be open at any instance of time. In figure 2.1 we can identify the banks, and the rows are inside the sense amplifiers. Memory accesses to an SDRAM are burst oriented. This means that an access read or write starts at a given address and continue for a given burst length in a predetermined order, e.g. natural order 1, 2, 3... and so on. A memory location has to be activated before it can be accessed. This is done by sending an ACTIVATE command together with the row and bank address of the memory 21

24 22 Background location. Once the correct row is activated one can address the correct column on which to start the operation. Hence data manipulations can only be done inside or between open rows. To close an open row the PRECHARGE command is issued. This can be done automatically by indicating AUTO PRECHARGE when performing a read or write access [12]. The above mentioned characteristics are specifically for DDR2 SDRAMs but similarities can be found in other SDRAM architectures. Bank 1 Bank 2 Bank 3 Row Address Memory Array Bank 0 Column Address Sense Amp. Column Decoder Data Bus Figure 2.1. An example of a DDR2 SDRAM memory array. In Figure 2.1 we see a simplified sketch of the memory array of a DDR2 SDRAM memory. It consist of four memory banks Bank 0 - Bank 3. Each bank has its own sense amplifier which acts as a row buffer and enables a total of one open row per bank. The row to be activated is selected by the Row Decoder based on the Row Address. All outer data communication passes through the Data Bus and the Column Decoder selects which data words are to be affected by the current operation based on the Column Address. The latency for a DDR2 SDRAM is usually presented as CAS, trcd, trp and tras. CAS latency is the time it takes from issuing a column command to the SDRAM starts processing it. trcd is the time it takes from that you activate a row to when you can issue the first column command. trp is the bank precharge time which indicates how long you have to wait to activate a new row after precharging the previous row in the same bank. tras is the time you have to wait after

25 2.2 Bit Dimension Permutation 23 activating a row in a bank until you can precharge that same row. tras is the row command cycle time which indicates the minimum time between two row activations in the same bank [12]. Together with the allowed clock frequency these values give you an indication of how fast the memory is. 2.2 Bit Dimension Permutation Bit Dimension Permutation (BDP) is a way of performing any data permutation by interchanging or inverting bits in the sample indexes [6,8,9,16]. This is convenient since it is easy to control such permutations by using the bits of a binary counter which allows for hardware efficient permutation circuits with low complexity. First we will go through the basics and see how it works, and later on we will present some examples whose purpose is to ease the understanding Introduction Let us say that we have a series of samples and give each of them an individual binary index. The number of bits (the number of bit dimensions) needed to represent all the different indexes is then n = log 2 (N), where N is the number of samples. If we think in bit-dimensions (binary), every dimension can take value 1 or 0. This is the same as one bit in our sample index. You can move groups of data by interchanging index bits with one another or inverting them, by doing so you are reordering the indexed samples. Every possible reordering, also called permutation, can be broken down to these simple manipulations of bits, it is probably necessary to do a series of them depending on the complexity of the permutation to perform, but the method remains the same. More theory on how this work in practice and the different kinds of permutations that is achieved by them can be found in [8]. The movements of index bits can in the case of a real-time system be seen as different delays applied on samples arriving serially at the input. The method of applying delays to samples is proposed in [9] in order to make optimal bit reversal circuits for serial data Notation and Inverse Permutation We annotate a BDP as a function σ of the different index bits u x in the following manner. σ(u n 1,..., u 1, u 0 ) = u 0, u 1,..., u n 1 (2.1) The example above is a bit reversal of n index bits, this can be seen by noting that the order is completely reversed, i.e. u n 1 has switched place with u 0, u n 2 with u 1 and so on. This is of course not the only permutation that can be formulated

26 24 Background in this way but it is a very important one in the area of FFT since the result is bit reversed if the input data is in natural order and vice versa [8]. In the case of a dedicated BDP circuit, the way of mapping the bits of a counter onto the required control signals is presented in [9]. If we are to perform the permutation on a memory we will instead use different address mappings as our way of permuting the data. These mappings describe how each counter bit will be mapped to the memory address bits. These mappings will have to change periodically if we would like to avoid double buffering. In double buffering you use two buffers instead of one and write to one buffer when you read from the other, once you have emptied the reading and filled up the writing buffer you switch so that you write into the buffer you previously read from and vice versa. This can be avoided by first reading and then writing to the same location in memory at all times. For now we permit ourselves to use double buffers and can therefore manage with just designing one mapping for each permutation assuming that we always write samples in natural order. In this way we can take a closer look on how to map the counter bits in order to perform a given permutation. Later in the following section on periodicity and address mapping we will treat the case when we do not allow double buffering. Once we have a determined permutation to perform and we have formulated it as a σ-function as above in equation 2.1, the way of mapping the counter bits onto the memory address is identical to the inverse permutation σ 1 as defined in [8] and presented in equation 2.2. σ(u) = u σ 1 (u ) = u (2.2) If the inverse permutation is applied to the result of the permutation the result of them both will be the identity function Id which is defined as Id(u) = u [8], hence it would be the same as not permute at all. This is true for all permutations. When you read and write to the same location, the difference is that you in most cases will have to define several permutations that are applied after another instead of just one in order to make it work Periodicity and Address Mapping The periodicity of a BDP is defined as the number of times you have to apply the permutation until you reach the identity function, this can be formulated as σ k = Id, where k is the periodicity [8]. In the case of bit reversal the periodicity is equal to two, hence it is also its own inverse permutation. This can be proven by reversing the bits twice and verify that the bit order is again natural. Periodicity is important when we want to avoid double buffering when permuting on a memory, because it defines how many address mappings we have to use in order to make it work, and hence it also gives a hint about the complexity of the

27 2.3 Previous Works 25 hardware. The number of address mappings will always be at least the periodicity. Let us take again the bit reversal as our first example. If the data set is written in natural order, we will have to read them out in bit reversed order to perform the permutation without additional hardware. If our permutation is to handle continuous flow without double buffering then we will receive the next data set while we output the previous, hence we will write the next data set using the bit reversed mapping at the same time as we output the first, and later read them out using the natural ordered mapping again while receiving the third data set. If we would have had a permutation with higher periodicity the number of mappings would have increased, and hence also the period of circulating through them and apply them onto the data sets. Each data set will though only be affected by two address mappings, one for writing and one for reading, and the resulting permutation of these two mappings should in all cases be the permutation which we were aiming to perform. This is the more difficult part when performing more complicated permutations, such as the Three Dimensional (3D) rotation performed on a SDRAM memory with access limitations that reorders all three dimensions of the 3D data set in a 3D FFT. 2.3 Previous Works This section presents in more detail the existing designs of MD FFTs which are most relevant to the topic. It is explained how they work and also their performance and hardware cost D FFT for Real Time Applications In [8] a pipelined 2D FFT design approach and example are presented, these are tailored for real time continuous flow applications. The design can be used with any ordinary 1D FFT. The idea is to put two 1D FFTs in series, together with a transposition in between them to reorder data so that if you feed the first FFT with data from the 2D input data set row-wise, then the second FFT will receive it column-wise. It can be shown that this is a valid approach by splitting up the two sums of the equation for the 2D DFT into an outer sum for one dimension and an inner sum for the other. Hence, one of the sums can be calculated on top of the result of the other and that is why each dimension can have its own 1D FFT and the result can be calculated one dimension at a time. Functional Description Since ordinary 1D FFTs are used the main challenge is to design the permutation circuit to perform the transposition between them, this permutation circuit per-

28 26 Background forms bit dimension permutation and is based on a counter and a memory. The counter is used to count the incoming samples arriving one sample per clock cycle. The design treats four complications together with any combination of them: The system may receive the first sample of the next data set the clock cycle after the last sample of the previous data set, the access to the memory might be limited, there might be a need for using several memories in parallel in order to achieve desired throughput of one sample per clock cycle and fourthly, the data set to calculate do not have to be square but may be non-square. The transposition to be performed can be formulated as the following BDP presented in equation 2.3, where n is the total number of bit dimensions in the index and j is the number of column index bits. The special case of a square data set appears if j is replaced by n/2. σ(u n 1,..., u j, u j 1,..., u 0 ) = u j 1,..., u 0, u n 1,..., u j (2.3) }{{}}{{} Rows Columns As can be realized by applying the permutation twice to the special case of a square data set, its periodicity is two. Hence, a transposition of a square matrix is its own inverse and the address mapping is the same as the permutation. This is though only valid for the cases when a memory without access limitations is used. If the data set is non-square the periodicity is no longer two, and in that case more than one address mapping is necessary in order to design a functional system. If an SDRAM is used for performing the permutation, and it is only possible to access one row at a time (the bank address bits can be interpreted as column address bits, since they can be accessed without limitation), the solution is to map maximum sized squares of data onto rows of the memory, note that the data set do not have to be square, only divided into smaller squares. The squares sides are all equal to L and the number of index bits to address its samples is λ = log 2 (L). This is dependent on that L, as well as the number of rows and columns of the data set, are powers of two. If the number of columns in each row of the memory is M C, it implies that L 2 M C. Hence, the first λ row and column bits are mapped to column bits of the memory to ensure access both row and column-wise inside the square to equalize the row changing overhead, so that it is the same for each address mapping. The bigger the squares can be, the lower the overhead. Therefore it is advised to maximize the value of L. If the data set is square the permutation remains its own inverse and hence the periodicity is still two. If, on the other hand, the data set is non-square and hence the periodicity not equal to two but a number k, then L is further limited since more address bits have to be mapped to columns in order to realize more address mappings. Instead of the previous limit of L 2 M C it is now limited by L k M C due to the periodicity. In practice this means that the SDRAM overhead for calculating non-square data sets should be larger than that of square data sets for comparable sizes. Furthermore, if the permutation is not possible to perform in its entirety inside

29 2.3 Previous Works 27 N C. L N C. L L L Figure 2.2. An example of how to map squares of data onto different memories in an interleaved fashion. This scheme is made for reading and writing to all, in this example four, memories simultaneously. the memory a small auxiliary permutation circuit is used. Together they enable the permutation to be performed on a constant flow of data sets. In the last scenario when several memories are used, the squares are mapped onto rows in the different memories in an interleaved fashion as depicted in Figure 2.2. Here C is the number of memories, N is the number of samples in each row and column of the data set and L is the side of each square. As you can see this data set is square but it does not necessary have to be so. Size Slices BRAMs Clk Freq. (MHz) Latency (ms) Table 2.1. The performance numbers of the 2D FFT for real time applications proposed in [8]. Hardware Cost and Performance As this design is a pipelined 2D FFT, it is designed to maintain a constant throughput. In this case it is set to be one sample per clock cycle (1 Sample/clk). It is

30 28 Background also parameterized so it can easily be reconfigured to different data set sizes, including rectangular ones. In Table 2.1 we can see examples of hardware usage and achievable clock frequency for some given 2D FFT sizes. The sample word length is 2 16 bits. The design is implemented in a Virtex 5 FPGA and also uses 4 Micron MT46V32M16 SDRAM chips with a total memory capacity of 256 MB. The reads and writes to and from these external SDRAMs are only performed once per sample, which means that in total two memory accesses per sample are necessary. The throughput in MSamples/s is equal to the clock frequency column of table Bandwidth Intense FPGA Architecture for Multi-Dimensional DFT This design which is proposed in [19] belongs to the iterative family of FFT architectures and performs 2D and 3D FFT by reading data from SDRAM and load it onto two local memories that functions as ping pong buffers. It performs Row FFT on spread rows followed by a so called column stride FFT followed by twiddle multiplications on the columns of those rows. The next step is reading of whole or parts of contiguous rows and performing so called column local FFT and columnwise permutation on these rows, once this is done for all rows of the data set the 2D FFT has been calculated. To calculate 3D FFT the 2D FFT procedure is carried out for all 2D slices of the 3D data set, then stride FFT, twiddle multiplication, local FFT and permutation is performed in the third dimension, much similar to the column operations of the 2D FFT. Functional Description The size of each dimension is N d where d is the index of the dimension. The size of each local memory is denoted S. In the first step focus is on row operations. The number of rows to be read for each iteration of this step is m, and the spacing between each such row is p, the number of rows is then N 2 = m p. The number of iterations of the step is p. During one step, the row-wise FFTs are calculated for all rows, after this, column-wise FFT of size m on the columns of the rows is applied, followed by twiddle multiplications. When this is done the result is written back to memory. In the second step, L contiguous rows with B elements per row are read from memory and column-wise FFTs of size p is performed on these rows followed by a column-wise permutation. Once this step is finished the 2D FFT calculation is complete. The memory accesses are illustrated in Figure 2.3 where each row of the data set is mapped to a row in the memory.

31 2.3 Previous Works 29 N1 N1 N1 p p p p L accessed along the rows p p N2 N2 N2 a) Row DFT b) Column Stride DFT c) Column Local DFT Figure 2.3. The access patterns of step one and two in the procedure of calculating the 2D FFT. Based on figure in [19]. d3 d2 d1 N3 N1 N3 N1 N2 N2 a) 2D DFT on each slice b) DFT along d3 on each slice Figure 2.4. The decomposition of a 3D data set into 2D slices. Based on figure in [19]. If 3D FFT is to be calculated then this process is repeated for every slice of the 3D data set spanned by the first and second dimensions. The slice decomposition can be seen in Figure 2.4. The architecture used to perform these calculations is depicted in Figure 2.5. Hardware Cost and Performance This design has also been implemented on a Virtex 5 FPGA. The number of occupied slices are 8,273 and it uses 68 DSP48E blocks in addition to 87 BRAMs for the example case in which the maximum FFT length is 2048 samples. The clock frequency is also constant at 100 MHz. The achieved latency and throughput of the 2D and 3D FFT can be seen in Table 2.2 and 2.3 respectively. The sample word length is 2 32 bits.

32 30 Background PE Local Memory Controller 1D FFT PLB SDRAM SDRAM Controller (MPMC) Switch Local Memory 0 Local Memory 1 Switch 1D FFT 1D FFT Twiddle Factor ROM Proposed FFT Architecture PE Array FPGA Figure 2.5. The proposed architecture used for calculating the 2D and 3D FFT. Based on figure in [19]. Size Latency (ms) Throughput (MSamples/s) Table 2.2. The performance numbers of the 2D FFT proposed in [19]. Size Latency (ms) Throughput (MSamples/s) Table 2.3. The performance numbers of the 3D FFT on Virtex 5 proposed in [19]. This design has also been simulated on a Virtex 6 FPGA in [20], so no hardware usage has been reported, but the acquired performance for the 3D FFT is presented in Table 2.4. For all cases, this architecture uses an external memory of at least twice the size of the data set.

33 2.3 Previous Works 31 Size Latency (ms) Throughput (MSamples/s) Table 2.4. The performance numbers of the 3D FFT on Virtex 6 proposed in [20] High Performance 3D-FFT Implementation A way of performing the 3D FFT on a Coarse Grain Reconfigurable Architecture (CGRA) implemented inside a Field Programmable Gate Array (FPGA) is proposed in [17]. It is an iterative architecture based on a network of processing elements, each consisting of a processor capable of performing butterfly operations, an instruction memory and a data memory. The permutations of data are performed by reconfiguring the interconnection network that connects each processing element to its neighbors. Sequencer Data Memory Instr. Memory DSP48E Output Figure 2.6. The architecture of a processing element in the CGRA. Functional Description The system consists of a set of processing elements, each capable of performing butterfly operations which consist of complex rotation, addition and subtraction. Its architecture can be found in Figure 2.6. Its brain is a sequencer which takes instructions from the instruction memory, which it also can manipulate, and initiates operations in the DSP48E slice, this is a hardware macro available in the

34 32 Background Virtex 5 FPGA family which is used for this design. The data memory provides two read ports and one write port enabling two operands to be read simultaneously to the DSP slice while allowing write-back of result. The data memory is also accessible from a neighbor via the reconfigurable interconnection network. The processing element is also capable of accessing another neighbor such that the processing elements all form a semi-systolic array with reconfigurable chains for data movement. When the amount of data is too large to fit inside the data memories of the processing elements, a part of it is stored in external SDRAM. This implies that a part of the 3D FFT is performed on the available data inside the processing elements, then this data is written back to memory and new data is loaded into the data memories. This is iterated until all computations has been performed. Each processing element has four possible neighbor connection points whereof only two can be active at a time, one for input and one for output. Because of this, data has to be placed, and the interconnection network configured in such a way that each processing network can access the required data to perform its operation at any given time. The details of the data movements are a bit unclear, but the results show that a very large amount of reconfigurations are needed when the size of the 3D data set approaches 256 3, indicating a large overhead when only a small amount of the data set can fit inside the memories of the processing elements. Hardware Cost and Performance This design is implemented on a Virtex 5 FPGA and uses 1728 slices for the 32 3 sized 3D FFT in addition to 64 DSP48E blocks and 148 BRAMs. The clock frequency is reported to be 300 MHz for the parts that performs the 3D FFT calculations. The performance numbers for this architecture can be found in Table 2.5. The word length of this design is 48 bits. Size Latency (ms) Throughput (MSamples/s) 4x4x x8x x16x x32x x64x x128x x256x Table 2.5. The performance numbers of the 3D FFT in [17]. It is unclear exactly how many times data has to be written and read from memory. For the cases where all data can be accommodated in local memories of the processing elements, we assume that only two accesses, one for read and one for write, is necessary. For the other cases we do not know how many times this has

35 2.4 Comparison of Available MD FFT Architectures 33 to be performed, but the decreasing performance with increasing data set size indicates that the number of accesses increase significantly when the data set can no longer fit in the local memory of the processing elements. The size of the external memory is also unclear but has to be at least the same size as the data set. 2.4 Comparison of Available MD FFT Architectures In general it has been difficult to extract all information from the different architectures regarding hardware usage and more detailed information on how the algorithm works in order to determine the number of memory accesses needed for a given MD FFT size. Anyhow, the information we have been able to extract is gathered in Table 2.6. Here we can see that Garrido s Virtex 5 2D FFT and Yu s Virtex 6 3D FFT architectures provide significantly higher throughputs than the competition. It should be noted here that the hardware usage for Yu s, although an iterative architecture, implementation probably exceeds that of Garrido based on the information we have from the D FFT.

36 34 Background FPGA Memory External DSP blocks / Throughput Type Author Family Size (int Kbits/ Locations / Slices (Mbits/s) ext MBytes) Sample 2D FFT 3D FFT Garrido Yu Yu Nidhi Virtex (0 / 1) 1 0 / Virtex (0 / 4) 1 0 / Virtex (72 / 4) 1 0 / Virtex (? /?)?? /? Virtex (? /?)?? /? Virtex (3132 /?)? 68 / Virtex (2088 / 128) 8 68 / Virtex (2088 / 256) 8 68 / Virtex (? / 128) 8? /? Virtex (? / 1024) 8? /? Virtex (3552 /?)? 64 / Virtex (3552 /?)? 64 / Table 2.6. A comparison of the MD FFTs presented in the previous works treated in this chapter.

37 Chapter 3 Proposed 2D and 3D FFT Architecture If we limit ourselves to only the solution which calculates the multi dimensional FFT one dimension at a time we can see that it present some general difficulties which will be presented in the problem formulation. Once we have gone through the details of the complications, we present the proposed approach to design a general 2D and 3D FFT architecture. Throughout this chapter we are making the assumption that the requirements are those of a real-time continuous flow application providing and requiring one sample per clock cycle. The 3D data sets arrive directly after another providing a constant stream of input samples. 3.1 Problem Formulation When in a real-time environment, tough constraints for both latency and throughput are common. The typical case is that we are located inside a processing chain, being provided data each clock cycle and required to deliver results at the same rate, within the latency requirement. One example of a hard real-time application could be post processing of a video stream that has to be seen in real-time for crucial decision making. If the latency would be too high, the decision maker would perhaps not be able to make the correct decision in time, with potentially catastrophic consequences. Another example could be real-time medical body scanning, in order to provide the doctors with the necessary information in time and to enable fast updates of the image, a low latency is required, and also the higher the throughput the shorter the scan will take, which increases the throughput of patients through the body scanner. 35

38 36 Proposed 2D and 3D FFT Architecture Pipelined or Iterative Architecture? Given these situations we can see that there is a need for fast computation of both the 2D and 3D FFT in real-time. When inside a continuous flow chain, the iterative architectures halt a bit. The reason for this is the repeated accesses to a memory for loading and storing intermediate results. Say for example that we for the 3D FFT have to read and write the data set three or four times, then the bandwidth to the memory has to be three or four times that of the processing chain and include a mandatory clock domain crossing (CDC) or use of several external memories. From experience CDCs are known to be tedious to debug if they are not designed in a good manner. Hence, we can conclude that iterative architectures are not suitable but possible to use in real-time application at the cost of very fast or several memories and advanced circuitry. Pipelined architectures, on the other hand, are inherently adapted to a continuous flow of samples. This enables the 1D FFTs to be calculated without memory accesses but with the cost of more hardware and mandatory permutations between the dimensions, these require memories but only read and write data once so the memory bandwidth can remain that of the processing chain. For small data sets these permutations can be performed in on-chip memories and hence not complicate things more than the hardware usage. On the other hand, for large data sets an external memory has typically to be used and this presents constraints to the permutations if the memory/memories have access limitations Permutations on Memories with Access Limitations When performing a 3D FFT, a part of it is a 2D FFT. In fact you are calculating the 2D FFT of each slice spanned by two of the three dimensions, followed by FFT calculation of the third dimension. Therefore, the first part of a pipelined 3D FFT is exactly the same as a pipelined 2D FFT as presented in [8], namely a 1D FFT followed by a transposition unit and another 1D FFT. The permutation performing the transposition is therefore present in both 2D and 3D FFTs. The 3D FFT has also another permutation to be performed between the second and third 1D FFT blocks, this permutation has to permute the whole 3D data set such that data is delivered in the third dimension to the third 1D FFT. If the amount of data is sufficiently large, one or both of these permutations have to be performed on external memory. Typically, the memory architecture used is SDRAM, chosen for its properties of providing both large memory size and also high speed. But SDRAMs are dynamic memories which need to be refreshed, and you can only access a limited activated amount of memory at a time. This presents some difficulties when performing real-time permutations as in the pipelined 2D and 3D FFT. Firstly, SDRAMs are burst oriented. This means that you have to access a series of samples at a time. In terms of permutations this means that the lowest address

Institutionen för systemteknik

Institutionen för systemteknik Institutionen för systemteknik Department of Electrical Engineering Examensarbete Design and Implementation of a DMA Controller for Digital Signal Processor Examensarbete utfört i Datorteknik vid Tekniska

More information

Energy Optimizations for FPGA-based 2-D FFT Architecture

Energy Optimizations for FPGA-based 2-D FFT Architecture Energy Optimizations for FPGA-based 2-D FFT Architecture Ren Chen and Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California Ganges.usc.edu/wiki/TAPAS Outline

More information

Institutionen för systemteknik

Institutionen för systemteknik Institutionen för systemteknik Department of Electrical Engineering Examensarbete Implementations of the FFT algorithm on GPU Examensarbete utfört i Elektroniksystem vid Tekniska högskolan vid Linköpings

More information

Linköping University Post Print. Analysis of Twiddle Factor Memory Complexity of Radix-2^i Pipelined FFTs

Linköping University Post Print. Analysis of Twiddle Factor Memory Complexity of Radix-2^i Pipelined FFTs Linköping University Post Print Analysis of Twiddle Factor Complexity of Radix-2^i Pipelined FFTs Fahad Qureshi and Oscar Gustafsson N.B.: When citing this work, cite the original article. 200 IEEE. Personal

More information

TOPICS PIPELINE IMPLEMENTATIONS OF THE FAST FOURIER TRANSFORM (FFT) DISCRETE FOURIER TRANSFORM (DFT) INVERSE DFT (IDFT) Consulted work:

TOPICS PIPELINE IMPLEMENTATIONS OF THE FAST FOURIER TRANSFORM (FFT) DISCRETE FOURIER TRANSFORM (DFT) INVERSE DFT (IDFT) Consulted work: 1 PIPELINE IMPLEMENTATIONS OF THE FAST FOURIER TRANSFORM (FFT) Consulted work: Chiueh, T.D. and P.Y. Tsai, OFDM Baseband Receiver Design for Wireless Communications, John Wiley and Sons Asia, (2007). Second

More information

The Serial Commutator FFT

The Serial Commutator FFT The Serial Commutator FFT Mario Garrido Gálvez, Shen-Jui Huang, Sau-Gee Chen and Oscar Gustafsson Journal Article N.B.: When citing this work, cite the original article. 2016 IEEE. Personal use of this

More information

Bandwidth, Area Efficient and Target Device Independent DDR SDRAM Controller

Bandwidth, Area Efficient and Target Device Independent DDR SDRAM Controller Bandwidth, Area Efficient and Target Device Independent DDR SDRAM T. Mladenov, F. Mujahid, E. Jung, and D. Har Abstract The application of the synchronous dynamic random access memory (SDRAM) has gone

More information

Institutionen för systemteknik

Institutionen för systemteknik Institutionen för systemteknik Department of Electrical Engineering Examensarbete Implementation of LTE baseband algorithms for a highly parallel DSP platform Examensarbete utfört i Datorteknik vid Tekniska

More information

Streaming Reduction Circuit for Sparse Matrix Vector Multiplication in FPGAs

Streaming Reduction Circuit for Sparse Matrix Vector Multiplication in FPGAs Computer Science Faculty of EEMCS Streaming Reduction Circuit for Sparse Matrix Vector Multiplication in FPGAs Master thesis August 15, 2008 Supervisor: dr.ir. A.B.J. Kokkeler Committee: dr.ir. A.B.J.

More information

Terrain Rendering using Multiple Optimally Adapting Meshes (MOAM)

Terrain Rendering using Multiple Optimally Adapting Meshes (MOAM) Examensarbete LITH-ITN-MT-EX--04/018--SE Terrain Rendering using Multiple Optimally Adapting Meshes (MOAM) Mårten Larsson 2004-02-23 Department of Science and Technology Linköpings Universitet SE-601 74

More information

Using a Scalable Parallel 2D FFT for Image Enhancement

Using a Scalable Parallel 2D FFT for Image Enhancement Introduction Using a Scalable Parallel 2D FFT for Image Enhancement Yaniv Sapir Adapteva, Inc. Email: yaniv@adapteva.com Frequency domain operations on spatial or time data are often used as a means for

More information

User Manual for FC100

User Manual for FC100 Sundance Multiprocessor Technology Limited User Manual Form : QCF42 Date : 6 July 2006 Unit / Module Description: IEEE-754 Floating-point FPGA IP Core Unit / Module Number: FC100 Document Issue Number:

More information

Institutionen för systemteknik

Institutionen för systemteknik Institutionen för systemteknik Department of Electrical Engineering Examensarbete Bus System for Coresonic SIMT DSP Examensarbete utfört i Elektroteknik vid Tekniska högskolan vid Linköpings universitet

More information

A 4096-Point Radix-4 Memory-Based FFT Using DSP Slices

A 4096-Point Radix-4 Memory-Based FFT Using DSP Slices A 4096-Point Radix-4 Memory-Based FFT Using DSP Slices Mario Garrido Gálvez, Miguel Angel Sanchez, Maria Luisa Lopez-Vallejo and Jesus Grajal Journal Article N.B.: When citing this work, cite the original

More information

Twiddle Factor Transformation for Pipelined FFT Processing

Twiddle Factor Transformation for Pipelined FFT Processing Twiddle Factor Transformation for Pipelined FFT Processing In-Cheol Park, WonHee Son, and Ji-Hoon Kim School of EECS, Korea Advanced Institute of Science and Technology, Daejeon, Korea icpark@ee.kaist.ac.kr,

More information

Core Facts. Documentation Design File Formats. Verification Instantiation Templates Reference Designs & Application Notes Additional Items

Core Facts. Documentation Design File Formats. Verification Instantiation Templates Reference Designs & Application Notes Additional Items (FFT_PIPE) Product Specification Dillon Engineering, Inc. 4974 Lincoln Drive Edina, MN USA, 55436 Phone: 952.836.2413 Fax: 952.927.6514 E mail: info@dilloneng.com URL: www.dilloneng.com Core Facts Documentation

More information

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM

More information

Core Facts. Documentation Design File Formats. Verification Instantiation Templates Reference Designs & Application Notes Additional Items

Core Facts. Documentation Design File Formats. Verification Instantiation Templates Reference Designs & Application Notes Additional Items (ULFFT) November 3, 2008 Product Specification Dillon Engineering, Inc. 4974 Lincoln Drive Edina, MN USA, 55436 Phone: 952.836.2413 Fax: 952.927.6514 E-mail: info@dilloneng.com URL: www.dilloneng.com Core

More information

Institutionen för systemteknik Department of Electrical Engineering

Institutionen för systemteknik Department of Electrical Engineering Institutionen för systemteknik Department of Electrical Engineering Examensarbete Automatic Parallel Memory Address Generation for Parallel DSP Computing Master thesis performed in Computer Engineering

More information

Institutionen för datavetenskap Department of Computer and Information Science

Institutionen för datavetenskap Department of Computer and Information Science Institutionen för datavetenskap Department of Computer and Information Science Final thesis Load management for a telecom charging system by Johan Bjerre LIU-IDA/LITH-EX-A--08/043--SE 2008-10-13 1 Linköpings

More information

An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs

An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs Architecture optimized for Fast Ultra Long FFTs Parallel FFT structure reduces external memory bandwidth requirements Lengths from 32K to

More information

Low Power and Memory Efficient FFT Architecture Using Modified CORDIC Algorithm

Low Power and Memory Efficient FFT Architecture Using Modified CORDIC Algorithm Low Power and Memory Efficient FFT Architecture Using Modified CORDIC Algorithm 1 A.Malashri, 2 C.Paramasivam 1 PG Student, Department of Electronics and Communication K S Rangasamy College Of Technology,

More information

DESIGN METHODOLOGY. 5.1 General

DESIGN METHODOLOGY. 5.1 General 87 5 FFT DESIGN METHODOLOGY 5.1 General The fast Fourier transform is used to deliver a fast approach for the processing of data in the wireless transmission. The Fast Fourier Transform is one of the methods

More information

Core Facts. Documentation Design File Formats. Verification Instantiation Templates Reference Designs & Application Notes Additional Items

Core Facts. Documentation Design File Formats. Verification Instantiation Templates Reference Designs & Application Notes Additional Items (FFT_MIXED) November 26, 2008 Product Specification Dillon Engineering, Inc. 4974 Lincoln Drive Edina, MN USA, 55436 Phone: 952.836.2413 Fax: 952.927.6514 E mail: info@dilloneng.com URL: www.dilloneng.com

More information

COMPUTER ARCHITECTURES

COMPUTER ARCHITECTURES COMPUTER ARCHITECTURES Random Access Memory Technologies Gábor Horváth BUTE Department of Networked Systems and Services ghorvath@hit.bme.hu Budapest, 2019. 02. 24. Department of Networked Systems and

More information

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy Chapter 5B Large and Fast: Exploiting Memory Hierarchy One Transistor Dynamic RAM 1-T DRAM Cell word access transistor V REF TiN top electrode (V REF ) Ta 2 O 5 dielectric bit Storage capacitor (FET gate,

More information

Flit Synchronous Aelite Network on Chip. Mahesh Balaji Subburaman

Flit Synchronous Aelite Network on Chip. Mahesh Balaji Subburaman Flit Synchronous Aelite Network on Chip Examensarbete utfört i Elektroniksystem vid Tekniska högskolan i Linköping av Mahesh Balaji Subburaman LiTH - ISY - EX -- 08 / 4198 -- SE Linköping 2008 Flit Synchronous

More information

Institutionen för systemteknik Department of Electrical Engineering

Institutionen för systemteknik Department of Electrical Engineering Institutionen för systemteknik Department of Electrical Engineering H.264 CODEC blocks implementation on FPGA Master thesis performed in Division of Electronic System by Umair Aslam LiTH-ISY-EX--14/4815--SE

More information

APPLICATION NOTE. SH3(-DSP) Interface to SDRAM

APPLICATION NOTE. SH3(-DSP) Interface to SDRAM APPLICATION NOTE SH3(-DSP) Interface to SDRAM Introduction This application note has been written to aid designers connecting Synchronous Dynamic Random Access Memory (SDRAM) to the Bus State Controller

More information

Design and Implementation of High Performance DDR3 SDRAM controller

Design and Implementation of High Performance DDR3 SDRAM controller Design and Implementation of High Performance DDR3 SDRAM controller Mrs. Komala M 1 Suvarna D 2 Dr K. R. Nataraj 3 Research Scholar PG Student(M.Tech) HOD, Dept. of ECE Jain University, Bangalore SJBIT,Bangalore

More information

Analysis of Radix- SDF Pipeline FFT Architecture in VLSI Using Chip Scope

Analysis of Radix- SDF Pipeline FFT Architecture in VLSI Using Chip Scope Analysis of Radix- SDF Pipeline FFT Architecture in VLSI Using Chip Scope G. Mohana Durga 1, D.V.R. Mohan 2 1 M.Tech Student, 2 Professor, Department of ECE, SRKR Engineering College, Bhimavaram, Andhra

More information

Institutionen för systemteknik

Institutionen för systemteknik Institutionen för systemteknik Department of Electrical Engineering Examensarbete Image interpolation in firmware for 3D display Examensarbete utfört i Elektroniksystem vid Tekniska högskolan i Linköping

More information

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology 1 Multilevel Memories Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind CPU-Memory Bottleneck 6.823

More information

PGT - A path generation toolbox for Matlab (v0.1)

PGT - A path generation toolbox for Matlab (v0.1) PGT - A path generation toolbox for Matlab (v0.1) Maria Nyström, Mikael Norrlöf Division of Automatic Control Department of Electrical Engineering Linköpings universitet, SE-581 83 Linköping, Sweden WWW:

More information

Institutionen för systemteknik

Institutionen för systemteknik Institutionen för systemteknik Department of Electrical Engineering Examensarbete Automated Fault Tree Generation from Requirement Structures Examensarbete utfört i Fordonssystem vid Tekniska högskolan

More information

Computing the Discrete Fourier Transform on FPGA Based Systolic Arrays

Computing the Discrete Fourier Transform on FPGA Based Systolic Arrays Computing the Discrete Fourier Transform on FPGA Based Systolic Arrays Chris Dick School of Electronic Engineering La Trobe University Melbourne 3083, Australia Abstract Reconfigurable logic arrays allow

More information

Institutionen för systemteknik

Institutionen för systemteknik Institutionen för systemteknik Department of Electrical Engineering Examensarbete Baseband Processing Using the Julia Language Examensarbete utfört i Elektroteknik vid Tekniska högskolan vid Linköpings

More information

RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC. Zoltan Baruch

RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC. Zoltan Baruch RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC Zoltan Baruch Computer Science Department, Technical University of Cluj-Napoca, 26-28, Bariţiu St., 3400 Cluj-Napoca,

More information

An introduction to SDRAM and memory controllers. 5kk73

An introduction to SDRAM and memory controllers. 5kk73 An introduction to SDRAM and memory controllers 5kk73 Presentation Outline (part 1) Introduction to SDRAM Basic SDRAM operation Memory efficiency SDRAM controller architecture Conclusions Followed by part

More information

ISSN Vol.05, Issue.12, December-2017, Pages:

ISSN Vol.05, Issue.12, December-2017, Pages: ISSN 2322-0929 Vol.05, Issue.12, December-2017, Pages:1174-1178 www.ijvdcs.org Design of High Speed DDR3 SDRAM Controller NETHAGANI KAMALAKAR 1, G. RAMESH 2 1 PG Scholar, Khammam Institute of Technology

More information

Institutionen för systemteknik

Institutionen för systemteknik Institutionen för systemteknik Department of Electrical Engineering Examensarbete Machine Learning for detection of barcodes and OCR Examensarbete utfört i Datorseende vid Tekniska högskolan vid Linköpings

More information

Memory Supplement for Section 3.6 of the textbook

Memory Supplement for Section 3.6 of the textbook The most basic -bit memory is the SR-latch with consists of two cross-coupled NOR gates. R Recall the NOR gate truth table: A S B (A + B) The S stands for Set to remember, and the R for Reset to remember.

More information

A Low Power DDR SDRAM Controller Design P.Anup, R.Ramana Reddy

A Low Power DDR SDRAM Controller Design P.Anup, R.Ramana Reddy A Low Power DDR SDRAM Controller Design P.Anup, R.Ramana Reddy Abstract This paper work leads to a working implementation of a Low Power DDR SDRAM Controller that is meant to be used as a reference for

More information

Programmable Memory Blocks Supporting Content-Addressable Memory

Programmable Memory Blocks Supporting Content-Addressable Memory Programmable Memory Blocks Supporting Content-Addressable Memory Frank Heile, Andrew Leaver, Kerry Veenstra Altera 0 Innovation Dr San Jose, CA 95 USA (408) 544-7000 {frank, aleaver, kerry}@altera.com

More information

Computer Systems Laboratory Sungkyunkwan University

Computer Systems Laboratory Sungkyunkwan University DRAMs Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Main Memory & Caches Use DRAMs for main memory Fixed width (e.g., 1 word) Connected by fixed-width

More information

A Normal I/O Order Radix-2 FFT Architecture to Process Twin Data Streams for MIMO

A Normal I/O Order Radix-2 FFT Architecture to Process Twin Data Streams for MIMO 2402 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 24, NO. 6, JUNE 2016 A Normal I/O Order Radix-2 FFT Architecture to Process Twin Data Streams for MIMO Antony Xavier Glittas,

More information

Research Article Design of A Novel 8-point Modified R2MDC with Pipelined Technique for High Speed OFDM Applications

Research Article Design of A Novel 8-point Modified R2MDC with Pipelined Technique for High Speed OFDM Applications Research Journal of Applied Sciences, Engineering and Technology 7(23): 5021-5025, 2014 DOI:10.19026/rjaset.7.895 ISSN: 2040-7459; e-issn: 2040-7467 2014 Maxwell Scientific Publication Corp. Submitted:

More information

Institutionen för systemteknik

Institutionen för systemteknik Institutionen för systemteknik Department of Electrical Engineering Examensarbete EtherCAT Communication on FPGA Based Sensor System Examensarbete utfört i Elektroniksystem vid Tekniska Högskolan, Linköpings

More information

Institutionen för systemteknik

Institutionen för systemteknik Institutionen för systemteknik Department of Electrical Engineering Examensarbete 3D Position Estimation of a Person of Interest in Multiple Video Sequences: Person of Interest Recognition Examensarbete

More information

Evaluation of instruction prefetch methods for Coresonic DSP processor

Evaluation of instruction prefetch methods for Coresonic DSP processor Master of Science Thesis in Electrical Engineering Department of Electrical Engineering, Linköping University, 2016 Evaluation of instruction prefetch methods for Coresonic DSP processor Tobias Lind Department

More information

CS650 Computer Architecture. Lecture 9 Memory Hierarchy - Main Memory

CS650 Computer Architecture. Lecture 9 Memory Hierarchy - Main Memory CS65 Computer Architecture Lecture 9 Memory Hierarchy - Main Memory Andrew Sohn Computer Science Department New Jersey Institute of Technology Lecture 9: Main Memory 9-/ /6/ A. Sohn Memory Cycle Time 5

More information

Institutionen för systemteknik

Institutionen för systemteknik Institutionen för systemteknik Department of Electrical Engineering Examensarbete Low Cost Floating-Point Extensions to a Fixed-Point SIMD Datapath Examensarbete utfört i Datorteknik vid Tekniska högskolan

More information

EEM 486: Computer Architecture. Lecture 9. Memory

EEM 486: Computer Architecture. Lecture 9. Memory EEM 486: Computer Architecture Lecture 9 Memory The Big Picture Designing a Multiple Clock Cycle Datapath Processor Control Memory Input Datapath Output The following slides belong to Prof. Onur Mutlu

More information

Keywords: Fast Fourier Transforms (FFT), Multipath Delay Commutator (MDC), Pipelined Architecture, Radix-2 k, VLSI.

Keywords: Fast Fourier Transforms (FFT), Multipath Delay Commutator (MDC), Pipelined Architecture, Radix-2 k, VLSI. ww.semargroup.org www.ijvdcs.org ISSN 2322-0929 Vol.02, Issue.05, August-2014, Pages:0294-0298 Radix-2 k Feed Forward FFT Architectures K.KIRAN KUMAR 1, M.MADHU BABU 2 1 PG Scholar, Dept of VLSI & ES,

More information

TECHNOLOGY BRIEF. Double Data Rate SDRAM: Fast Performance at an Economical Price EXECUTIVE SUMMARY C ONTENTS

TECHNOLOGY BRIEF. Double Data Rate SDRAM: Fast Performance at an Economical Price EXECUTIVE SUMMARY C ONTENTS TECHNOLOGY BRIEF June 2002 Compaq Computer Corporation Prepared by ISS Technology Communications C ONTENTS Executive Summary 1 Notice 2 Introduction 3 SDRAM Operation 3 How CAS Latency Affects System Performance

More information

DESIGN AND IMPLEMENTATION OF SDR SDRAM CONTROLLER IN VHDL. Shruti Hathwalia* 1, Meenakshi Yadav 2

DESIGN AND IMPLEMENTATION OF SDR SDRAM CONTROLLER IN VHDL. Shruti Hathwalia* 1, Meenakshi Yadav 2 ISSN 2277-2685 IJESR/November 2014/ Vol-4/Issue-11/799-807 Shruti Hathwalia et al./ International Journal of Engineering & Science Research DESIGN AND IMPLEMENTATION OF SDR SDRAM CONTROLLER IN VHDL ABSTRACT

More information

Chapter 8 Memory Basics

Chapter 8 Memory Basics Logic and Computer Design Fundamentals Chapter 8 Memory Basics Charles Kime & Thomas Kaminski 2008 Pearson Education, Inc. (Hyperlinks are active in View Show mode) Overview Memory definitions Random Access

More information

A Universal Test Pattern Generator for DDR SDRAM *

A Universal Test Pattern Generator for DDR SDRAM * A Universal Test Pattern Generator for DDR SDRAM * Wei-Lun Wang ( ) Department of Electronic Engineering Cheng Shiu Institute of Technology Kaohsiung, Taiwan, R.O.C. wlwang@cc.csit.edu.tw used to detect

More information

EECS150 - Digital Design Lecture 16 - Memory

EECS150 - Digital Design Lecture 16 - Memory EECS150 - Digital Design Lecture 16 - Memory October 17, 2002 John Wawrzynek Fall 2002 EECS150 - Lec16-mem1 Page 1 Memory Basics Uses: data & program storage general purpose registers buffering table lookups

More information

A Configurable High-Throughput Linear Sorter System

A Configurable High-Throughput Linear Sorter System A Configurable High-Throughput Linear Sorter System Jorge Ortiz Information and Telecommunication Technology Center 2335 Irving Hill Road Lawrence, KS jorgeo@ku.edu David Andrews Computer Science and Computer

More information

The S6000 Family of Processors

The S6000 Family of Processors The S6000 Family of Processors Today s Design Challenges The advent of software configurable processors In recent years, the widespread adoption of digital technologies has revolutionized the way in which

More information

An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs

An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs HPEC 2004 Abstract Submission Dillon Engineering, Inc. www.dilloneng.com An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs Tom Dillon Dillon Engineering, Inc. This presentation outlines

More information

Computed Tomography (CT) Scan Image Reconstruction on the SRC-7 David Pointer SRC Computers, Inc.

Computed Tomography (CT) Scan Image Reconstruction on the SRC-7 David Pointer SRC Computers, Inc. Computed Tomography (CT) Scan Image Reconstruction on the SRC-7 David Pointer SRC Computers, Inc. CT Image Reconstruction Herman Head Sinogram Herman Head Reconstruction CT Image Reconstruction for all

More information

Implementation of FFT Processor using Urdhva Tiryakbhyam Sutra of Vedic Mathematics

Implementation of FFT Processor using Urdhva Tiryakbhyam Sutra of Vedic Mathematics Implementation of FFT Processor using Urdhva Tiryakbhyam Sutra of Vedic Mathematics Yojana Jadhav 1, A.P. Hatkar 2 PG Student [VLSI & Embedded system], Dept. of ECE, S.V.I.T Engineering College, Chincholi,

More information

24K FFT for 3GPP LTE RACH Detection

24K FFT for 3GPP LTE RACH Detection 24K FFT for GPP LTE RACH Detection ovember 2008, version 1.0 Application ote 515 Introduction In GPP Long Term Evolution (LTE), the user equipment (UE) transmits a random access channel (RACH) on the uplink

More information

Parallel FIR Filters. Chapter 5

Parallel FIR Filters. Chapter 5 Chapter 5 Parallel FIR Filters This chapter describes the implementation of high-performance, parallel, full-precision FIR filters using the DSP48 slice in a Virtex-4 device. ecause the Virtex-4 architecture

More information

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp. 13 1 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas 110 Winter 2009 CMPE Cache Direct-mapped cache Reads and writes Cache associativity Cache and performance Textbook Edition: 7.1 to 7.3 Third

More information

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis Bruno da Silva, Jan Lemeire, An Braeken, and Abdellah Touhafi Vrije Universiteit Brussel (VUB), INDI and ETRO department, Brussels,

More information

Design and Verification of High Speed SDRAM Controller with Adaptive Bank Management and Command Pipeline

Design and Verification of High Speed SDRAM Controller with Adaptive Bank Management and Command Pipeline Design and Verification of High Speed SDRAM Controller with Adaptive Bank Management and Command Pipeline Ganesh Mottee, P.Shalini Mtech student, Dept of ECE, SIR MVIT Bangalore, VTU university, Karnataka,

More information

ECE 485/585 Microprocessor System Design

ECE 485/585 Microprocessor System Design Microprocessor System Design Lecture 5: Zeshan Chishti DRAM Basics DRAM Evolution SDRAM-based Memory Systems Electrical and Computer Engineering Dept. Maseeh College of Engineering and Computer Science

More information

ECE 485/585 Microprocessor System Design

ECE 485/585 Microprocessor System Design Microprocessor System Design Lecture 4: Memory Hierarchy Memory Taxonomy SRAM Basics Memory Organization DRAM Basics Zeshan Chishti Electrical and Computer Engineering Dept Maseeh College of Engineering

More information

LOW-POWER SPLIT-RADIX FFT PROCESSORS

LOW-POWER SPLIT-RADIX FFT PROCESSORS LOW-POWER SPLIT-RADIX FFT PROCESSORS Avinash 1, Manjunath Managuli 2, Suresh Babu D 3 ABSTRACT To design a split radix fast Fourier transform is an ideal person for the implementing of a low-power FFT

More information

EECS150 - Digital Design Lecture 17 Memory 2

EECS150 - Digital Design Lecture 17 Memory 2 EECS150 - Digital Design Lecture 17 Memory 2 October 22, 2002 John Wawrzynek Fall 2002 EECS150 Lec17-mem2 Page 1 SDRAM Recap General Characteristics Optimized for high density and therefore low cost/bit

More information

Low-Power Split-Radix FFT Processors Using Radix-2 Butterfly Units

Low-Power Split-Radix FFT Processors Using Radix-2 Butterfly Units Low-Power Split-Radix FFT Processors Using Radix-2 Butterfly Units Abstract: Split-radix fast Fourier transform (SRFFT) is an ideal candidate for the implementation of a lowpower FFT processor, because

More information

Two-level Reconfigurable Architecture for High-Performance Signal Processing

Two-level Reconfigurable Architecture for High-Performance Signal Processing International Conference on Engineering of Reconfigurable Systems and Algorithms, ERSA 04, pp. 177 183, Las Vegas, Nevada, June 2004. Two-level Reconfigurable Architecture for High-Performance Signal Processing

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

FPGA Matrix Multiplier

FPGA Matrix Multiplier FPGA Matrix Multiplier In Hwan Baek Henri Samueli School of Engineering and Applied Science University of California Los Angeles Los Angeles, California Email: chris.inhwan.baek@gmail.com David Boeck Henri

More information

ISSN: [Bilani* et al.,7(2): February, 2018] Impact Factor: 5.164

ISSN: [Bilani* et al.,7(2): February, 2018] Impact Factor: 5.164 IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY A REVIEWARTICLE OF SDRAM DESIGN WITH NECESSARY CRITERIA OF DDR CONTROLLER Sushmita Bilani *1 & Mr. Sujeet Mishra 2 *1 M.Tech Student

More information

CS698Y: Modern Memory Systems Lecture-16 (DRAM Timing Constraints) Biswabandan Panda

CS698Y: Modern Memory Systems Lecture-16 (DRAM Timing Constraints) Biswabandan Panda CS698Y: Modern Memory Systems Lecture-16 (DRAM Timing Constraints) Biswabandan Panda biswap@cse.iitk.ac.in https://www.cse.iitk.ac.in/users/biswap/cs698y.html Row decoder Accessing a Row Access Address

More information

Basic FPGA Architectures. Actel FPGAs. PLD Technologies: Antifuse. 3 Digital Systems Implementation Programmable Logic Devices

Basic FPGA Architectures. Actel FPGAs. PLD Technologies: Antifuse. 3 Digital Systems Implementation Programmable Logic Devices 3 Digital Systems Implementation Programmable Logic Devices Basic FPGA Architectures Why Programmable Logic Devices (PLDs)? Low cost, low risk way of implementing digital circuits as application specific

More information

Efficient Self-Reconfigurable Implementations Using On-Chip Memory

Efficient Self-Reconfigurable Implementations Using On-Chip Memory 10th International Conference on Field Programmable Logic and Applications, August 2000. Efficient Self-Reconfigurable Implementations Using On-Chip Memory Sameer Wadhwa and Andreas Dandalis University

More information

Performance Evolution of DDR3 SDRAM Controller for Communication Networks

Performance Evolution of DDR3 SDRAM Controller for Communication Networks Performance Evolution of DDR3 SDRAM Controller for Communication Networks U.Venkata Rao 1, G.Siva Suresh Kumar 2, G.Phani Kumar 3 1,2,3 Department of ECE, Sai Ganapathi Engineering College, Visakhaapatnam,

More information

EECS150 - Digital Design Lecture 11 SRAM (II), Caches. Announcements

EECS150 - Digital Design Lecture 11 SRAM (II), Caches. Announcements EECS15 - Digital Design Lecture 11 SRAM (II), Caches September 29, 211 Elad Alon Electrical Engineering and Computer Sciences University of California, Berkeley http//www-inst.eecs.berkeley.edu/~cs15 Fall

More information

8. Migrating Stratix II Device Resources to HardCopy II Devices

8. Migrating Stratix II Device Resources to HardCopy II Devices 8. Migrating Stratix II Device Resources to HardCopy II Devices H51024-1.3 Introduction Altera HardCopy II devices and Stratix II devices are both manufactured on a 1.2-V, 90-nm process technology and

More information

Design and Implementation of an Application Programming Interface for Volume Rendering

Design and Implementation of an Application Programming Interface for Volume Rendering LITH-ITN-MT-EX--02/06--SE Design and Implementation of an Application Programming Interface for Volume Rendering Examensarbete utfört i Medieteknik vid Linköpings Tekniska Högskola, Campus Norrköping Håkan

More information

Multimedia Decoder Using the Nios II Processor

Multimedia Decoder Using the Nios II Processor Multimedia Decoder Using the Nios II Processor Third Prize Multimedia Decoder Using the Nios II Processor Institution: Participants: Instructor: Indian Institute of Science Mythri Alle, Naresh K. V., Svatantra

More information

Resource-efficient Acceleration of 2-Dimensional Fast Fourier Transform Computations on FPGAs

Resource-efficient Acceleration of 2-Dimensional Fast Fourier Transform Computations on FPGAs In Proceedings of the International Conference on Distributed Smart Cameras, Como, Italy, August 2009. Resource-efficient Acceleration of 2-Dimensional Fast Fourier Transform Computations on FPGAs Hojin

More information

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1 CSE 431 Computer Architecture Fall 2008 Chapter 5A: Exploiting the Memory Hierarchy, Part 1 Mary Jane Irwin ( www.cse.psu.edu/~mji ) [Adapted from Computer Organization and Design, 4 th Edition, Patterson

More information

Low Power Complex Multiplier based FFT Processor

Low Power Complex Multiplier based FFT Processor Low Power Complex Multiplier based FFT Processor V.Sarada, Dr.T.Vigneswaran 2 ECE, SRM University, Chennai,India saradasaran@gmail.com 2 ECE, VIT University, Chennai,India vigneshvlsi@gmail.com Abstract-

More information

International Journal of Innovative and Emerging Research in Engineering. e-issn: p-issn:

International Journal of Innovative and Emerging Research in Engineering. e-issn: p-issn: Available online at www.ijiere.com International Journal of Innovative and Emerging Research in Engineering e-issn: 2394-3343 p-issn: 2394-5494 Design and Implementation of FFT Processor using CORDIC Algorithm

More information

The Next Generation 65-nm FPGA. Steve Douglass, Kees Vissers, Peter Alfke Xilinx August 21, 2006

The Next Generation 65-nm FPGA. Steve Douglass, Kees Vissers, Peter Alfke Xilinx August 21, 2006 The Next Generation 65-nm FPGA Steve Douglass, Kees Vissers, Peter Alfke Xilinx August 21, 2006 Hot Chips, 2006 Structure of the talk 65nm technology going towards 32nm Virtex-5 family Improved I/O Benchmarking

More information

High Throughput Energy Efficient Parallel FFT Architecture on FPGAs

High Throughput Energy Efficient Parallel FFT Architecture on FPGAs High Throughput Energy Efficient Parallel FFT Architecture on FPGAs Ren Chen Ming Hsieh Department of Electrical Engineering University of Southern California Los Angeles, USA 989 Email: renchen@usc.edu

More information

Internal Memory. Computer Architecture. Outline. Memory Hierarchy. Semiconductor Memory Types. Copyright 2000 N. AYDIN. All rights reserved.

Internal Memory. Computer Architecture. Outline. Memory Hierarchy. Semiconductor Memory Types. Copyright 2000 N. AYDIN. All rights reserved. Computer Architecture Prof. Dr. Nizamettin AYDIN naydin@yildiz.edu.tr nizamettinaydin@gmail.com Internal Memory http://www.yildiz.edu.tr/~naydin 1 2 Outline Semiconductor main memory Random Access Memory

More information

Decimation-in-Frequency (DIF) Radix-2 FFT *

Decimation-in-Frequency (DIF) Radix-2 FFT * OpenStax-CX module: m1018 1 Decimation-in-Frequency (DIF) Radix- FFT * Douglas L. Jones This work is produced by OpenStax-CX and licensed under the Creative Commons Attribution License 1.0 The radix- decimation-in-frequency

More information

Memory Management Algorithms on Distributed Systems. Katie Becker and David Rodgers CS425 April 15, 2005

Memory Management Algorithms on Distributed Systems. Katie Becker and David Rodgers CS425 April 15, 2005 Memory Management Algorithms on Distributed Systems Katie Becker and David Rodgers CS425 April 15, 2005 Table of Contents 1. Introduction 2. Coarse Grained Memory 2.1. Bottlenecks 2.2. Simulations 2.3.

More information

Gedae cwcembedded.com. The CHAMP-AV6 VPX-REDI. Digital Signal Processing Card. Maximizing Performance with Minimal Porting Effort

Gedae cwcembedded.com. The CHAMP-AV6 VPX-REDI. Digital Signal Processing Card. Maximizing Performance with Minimal Porting Effort Technology White Paper The CHAMP-AV6 VPX-REDI Digital Signal Processing Card Maximizing Performance with Minimal Porting Effort Introduction The Curtiss-Wright Controls Embedded Computing CHAMP-AV6 is

More information

SFF The Single-Stream FPGA-Optimized Feedforward FFT Hardware Architecture

SFF The Single-Stream FPGA-Optimized Feedforward FFT Hardware Architecture Journal of Signal Processing Systems (2018) 90:1583 1592 https://doi.org/10.1007/s11265-018-1370-y SFF The Single-Stream FPGA-Optimized Feedforward FFT Hardware Architecture Carl Ingemarsson 1 Oscar Gustafsson

More information

White Paper The Need for a High-Bandwidth Memory Architecture in Programmable Logic Devices

White Paper The Need for a High-Bandwidth Memory Architecture in Programmable Logic Devices Introduction White Paper The Need for a High-Bandwidth Memory Architecture in Programmable Logic Devices One of the challenges faced by engineers designing communications equipment is that memory devices

More information

CENG3420 Lecture 08: Memory Organization

CENG3420 Lecture 08: Memory Organization CENG3420 Lecture 08: Memory Organization Bei Yu byu@cse.cuhk.edu.hk (Latest update: February 22, 2018) Spring 2018 1 / 48 Overview Introduction Random Access Memory (RAM) Interleaving Secondary Memory

More information

EN2911X: Reconfigurable Computing Topic 01: Programmable Logic

EN2911X: Reconfigurable Computing Topic 01: Programmable Logic EN2911X: Reconfigurable Computing Topic 01: Programmable Logic Prof. Sherief Reda School of Engineering, Brown University Fall 2012 1 FPGA architecture Programmable interconnect Programmable logic blocks

More information