REMOTE SENSING AND IMAGING IN A RECONFIGURABLE COMPUTING ENVIRONMENT

Size: px

Start display at page:

Download "REMOTE SENSING AND IMAGING IN A RECONFIGURABLE COMPUTING ENVIRONMENT"

Hillary Perry
6 years ago
Views:

1 REMOTE SENSING AND IMAGING IN A RECONFIGURABLE COMPUTING ENVIRONMENT By VIKAS AGGARWAL A THESIS PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE UNIVERSITY OF FLORIDA 2005

2 Thi document i dedicated to my parent and my iter.

3 ACKNOWLEDGMENTS I would firt like to thank the all mighty for giving me an opportunity to come thi far in life. I alo wih to thank the department of ECE at UF, all the profeor for their word of widom, Dr. Alan George and Dr. Kenneth Slatton for their infinite upport, guidance and encouraging word, and all the member of the High-performance Computing and Simulation Lab and Adaptive Signal Proceing Lab for their technical upport and friendhip. I alo take thi opportunity to thank my parent for their nurture and upport, and my iter for alway encouraging me whenever I wa down. I hope I can fulfill all their expectation in life. iii

4 TABLE OF CONTENTS page ACKNOWLEDGMENTS... iii LIST OF TABLES... vi LIST OF FIGURES... vii ABSTRACT... ix CHAPTER 1 INTRODUCTION BACKGROUND AND RELATED RESEARCH Reconfigurable Computing The Era of Programmable Hardware Device The Enabling Technology for RC: FPGA Remote-Sening Tet Application: Data Fuion Data Acquiition Proce Data Fuion: Multicale Kalman Filter and Smoother Related Reearch FEASIBILITY ANALYSIS AND SYSTEM ARCHITECTURE Iue and Trade-off Fixed-point v. Floating-point arithmetic One-dimenional Time-tracking Kalman Filter Deign Full Sytem Architecture MULTISCALE FILTER DESIGN AND RESULTS Experimental Setup Deign Architecture # Deign Architecture # Deign Architecture # Performance Projection on Other Sytem Nallatech BenNUEY Motherboard (with BenBLUE-II daughter card) Honeywell Reconfigurable Space Computer (HRSC)...50 iv

5 5 CONCLUSIONS AND FUTURE WORK...52 LIST OF REFERENCES...56 BIOGRAPHICAL SKETCH...60 v

6 LIST OF TABLES Table page 3.1 Pot place and route reult howing the reource utilization for a 32x32 bit divider and the maximum frequency of operation Quantization error introduced by fixed-point arithmetic Performance reult tabulated over 600 data point (lice occupied: 2%) Performance comparion of a proceor with an FPGA configured with deign # Comparion of the execution time on the FPGA with Xeon proceor and reource requirement of the hardware configuration for deign # Component of the execution time on an FPGA for proceing eight row of input image uing the deign # Component of the total execution time on FPGA for proceing a ingle cale of input data with different level of concurrent proceing...46 vi

7 LIST OF FIGURES Figure page 2.1 Variou computation platform on the flexibility and performance pectrum A implitic view of an FPGA and a logic cell inide that form the baic building block Block diagram decription of RC1000 board. Courtey: RC1000 reference manual The tep involved in data acquiition and data proceing Quad-tree data tructure Data generation and validation Set of equential proceing tep involved in 1-D Kalman filter Difference between Matlab computed etimate and Handel-C fixed point etimate uing eight bit of preciion after the binary point for 100 data point Proceing path for (a) calculation of parameter and (b) calculation of the etimate High-level ytem block diagram Simulated obervation correponding to 256x256 reolution cale Block diagram of the architecture of deign # Block diagram for the architecture of both the Kalman filter and moother Error tatitic of the output obtained from the fixed-point implementation Performance improvement and change in reource requirement with increae in concurrent computation Block diagram of the architecture of deign # Performance improvement and change in reource requirement with increae in number of concurrently proceed pixel...44 vii

8 4.8 Block diagram of the architecture of the deign # Improvement in the performance with increae in concurrent computation Error tatitic for the output obtained after ingle cale of filtering Block diagram for the hardware architecture of the Nallatech BenNUEY board with a BenBLUE-II extenion card Block diagram of the hardware architecture of HRSC board...50 viii

9 Abtract of Thei Preented to the Graduate School of the Univerity of Florida in Partial Fulfillment of the Requirement for the Degree of Mater of Science REMOTE SENSING AND IMAGING IN A RECONFIGURABLE COMPUTING ENVIRONMENT By Vika Aggarwal December 2005 Chair: Kenneth C. Slatton Cochair: Alan D. George Major Department: Electrical and Computer Engineering In recent year, there ha been a ignificant improvement in the enor employed for data collection. Thi ha further puhed the envelope of the amount of data involved, rendering the conventional technique of data collection, diemination and ground-baed proceing impractical in everal ituation. The poibility of on-board proceing ha opened new door for real-time application and reduced the demand on the bandwidth of the downlink. Reconfigurable computing, a new tar in the field of high-performance computing, could be employed a the enabling technology for uch ytem where conventional computing reource are contrained by many factor a decribed later. Thi work explore the poibility of deploying reconfigurable ytem in remote ening application. A a cae tudy, a data fuion application, which combine the information obtained from multiple enor of different reolution, i ued to perform feaibility analyi. The concluion drawn from different deign architecture for the tet ix

10 application are ued to identify the limitation of current ytem and propoe future ytem enabled with RC reource. x

11 CHAPTER 1 INTRODUCTION Recent advance in enor technology, uch a increaed reolution, frame rate, and number of channel, have reulted in a tremendou increae in the amount of data available for imaging application, uch a airborne and pace-baed remote ening of the Earth, biomedical imaging, and computer viion. Raw data collected from the enor mut uually undergo ignificant proceing before it can be properly interpreted. The need to maximize proceing throughput, epecially on embedded platform, ha in turn driven the need for new proceing modalitie. Today data acquiition and diemination ytem need to perform more proceing than the previou ytem to upport real-time application and reduce the bandwidth demand on the downlink. Though cluter-baed computing reource are the mot widely ued platform on ground tation, everal factor, like pace, cot and power make them impractical for on-board proceing. FPGA-baed reconfigurable ytem are emerging a low-cot olution which offer enormou computation potential in both the cluter-baed ytem and embedded ytem arena. Remote ening ytem are employed in many different form, covering the gamut from compact, power- and weight-limited atellite and airborne ytem to much larger ground-baed ytem. The enormou number of enor involved in the data collection proce place heavy demand on the I/O capabilitie of the ytem. The olution to the problem could be approached from two direction: firt, performing data compreion onboard before tranmitting data; or econd, performing ome onboard computation and 1

12 2 tranmitting the proceed data. The target computation ytem mut be capable of proceing multiple data tream in parallel at a high rate to upport real-time application which further increae the complexity of the problem. The nature of the problem demand that the proceing ytem hould not only be capable of high performance but alo be able to deliver excellent performance per unit cot (where the cot include everal factor uch a power, pace and ytem price). There have been a plethora of publication which have demontrated ucce in porting everal remote ening and image proceing application to FPGA-baed platform [1-5]. Some reearcher have alo created high-level algorithm development environment that expedite the porting and treamlining of uch application code [6-8]. But, undertanding the pecial need of thi cla of application and analyi of exiting platform to determine their viability a future computation engine for remote ening ytem warrant further reearch and examination. Identifying ome miing component in current platform that are eential for uch ytem i the focu of thi work. A remote ening application i ued to illutrate the proce. In thi work, a data fuion application ha been choen a repreentative of the cla of remote ening application, for the reaon that it incorporate a wide variety of feature that tre different apect of the target computation ytem. The recent interet in enor reearch ha lead to a multitude of enor in the market, which differ dratically in their phenomenology, accuracy, reolution and quantity of data. Traditionally, Interferometric Synthetic Aperture Radar (InSAR) ha been employed for mapping extended area of terrain with moderate reolution. Airborne Laer Swath Mapping (ALSM) ha been increaingly employed to map local elevation

13 3 at high reolution over maller region. A multicale etimation framework can then be employed to fue the data obtained from uch different enor having different reolution to produce improved etimate over large coverage area while maintaining high reolution locally. The nature of the proceing involved impoe enormou computation burden on the ytem. Hence, the target ytem hould be equipped with enormou computation potential. Since their inception, early proceing ytem have fallen into two eparate camp. The firt camp aw a need to accommodate wide varietie of application with multiple procee running concurrently on the ame ytem, and therefore choe General-Purpoe Proceor (GPP) to erve their need. The other camp preferred to improve on the peed of the application and choe to leverage the performance advantage of the Application- Specific Integrated Circuit (ASIC). Over a period of time, thee two camp trayed further apart in term of proceing abilitie, flexibility and cot involved. Meanwhile, due to the technological advancement over the pat decade, Reconfigurable Computing (RC) ha garnered a great deal of attention from both the academic community and indutry. Reconfigurable ytem have tried to fue the merit of both the camp and have proven to be a promiing alternative. RC ha demontrated improved performance in peed on the order of 10 to 100 in comparion to GPP for certain application domain uch a image and ignal proceing [2, 4, 5, 9]. An even more remarkable apect of thi relatively new programming paradigm i that the performance improvement are obtained at le than two third of the cot of conventional proceor. FPGA-enabled reconfigurable proceing platform have even outperformed the ASIC in market domain including ignal proceing and cryptography

14 4 where ASIC and DSP have been the dominant modalitie for decade. The pat decade ha een a mountainou growth in RC technology, but it i till in it infancy tage. The development tool, target ytem architecture and even procee for porting the application need to mature before they can make meaningful accomplihment. However, RC-baed deign have already hown performance peedup in application domain uch a image proceing which require imilar proceing pattern to many remote-ening application. Conventional proceor-baed reource cannot be employed in uch application becaue of their inherent limitation of ize, power and weight, which RC-baed ytem can overcome. The tructure of imaging algorithm lend itelf to a high degree of parallelim that can be exploited in hardware by the FPGA. The computation are often data-parallel, require little control, and contain large data et (effectively infinite tream), and raw enor data element that do not have large bit width, making them amenable to RC. However, thi cla of application ha three characteritic which make them challenging. Firt, they involve many arithmetic operation (e.g. multiply-accumulate and trigonometric function) on real and/or complex data. Second, they require ignificant memory upport, not jut in the capacity of memory but alo in the bandwidth that can be upported. Third, the cale of computation i large, requiring (poibly) hundred of parallel operation per econd and high-bandwidth interconnection to meet real-time contraint. Thee challenge mut be addreed if RC ytem are to ignificantly impact future remote-ening ytem. Thi work explore the poibility of deploying reconfigurable ytem in remote-ening application uing the choen tet cae for feaibility analyi. The concluion drawn from different deign architecture for the tet application are ued to identify the

15 5 limitation of the current ytem and propoe olution to enable future ytem with RC reource. The tructure of the remaining document i a follow. Chapter 2 preent a brief background on reconfigurable computing uing FPGA a the enabling technologie and data fuion uing multicale Kalman filter/moother. It alo preent a dicuion on the related reearch in the field of reconfigurable computing a applied to remote ening and imilar application domain. Chapter 3 preent ome tet performed for initial tudy and feaibility analyi. The experiment are baed on deign of the 1-D Kalman filter, which form the heart of the calculation involved in the data fuion application. Chapter 4 preent a equence of reviion to 2-D filter deign developed to olve the problem along with the aociated methodologie. Each of thee deign build on the limitation identified in the previou deign and propoe a better olution under the given ytem contraint. Their performance i compared with a baeline C code running on a Xeon proceor. Several graph and table that are derived from the reult are alo preented. The final architecture in the chapter emulate the performance of an ideal ytem due to which it outperform the proceor-baed olution by achieving over an order of magnitude peedup. Chapter 5 ummarize the reearch and the finding of thi work. It draw concluion baed on the reult and obervation preented in the previou chapter. It alo give ome future direction for work beyond thi thei.

16 CHAPTER 2 BACKGROUND AND RELATED RESEARCH Thi work involve an interdiciplinary tudy of remote-ening application and a new paradigm in the field of high-performance computing: Reconfigurable Computing. The fat development time with reconfigurable device, their denity, advanced feature uch a optimized compact hardwire core, programmable interconnection, memory array, and communication interface have made them a very attractive option for both terretrial and pace-/air-borne application. There are multiple advantage of equipping future ytem with reconfigurable computation engine. Firt, they help in overcoming the limited bandwidth problem on the downlink. Second, they create a poibility of providing everal real-time application on-board. Third, they can alo be ued for feedback mechanim to change data collection trategy in repone to the quality of the received data or for changing intrumentation planning policy. Thi thei aim at deigning different architecture for a tet application to analyze variou feature of the exiting platform and ugget neceary improvement. The hardware deign for the FPGA are implemented uing the Handel-C language (with DK- 3 a it integrated development environment) to expedite the proce a compared to the conventional HDL deign flow. Thi chapter provide a comprehenive dicuion on different apect of reconfigurable computing with a brief decription of the application. To ummarize the exiting reearch, a brief review of the relevant prior work in thi field and other related field i alo preented in thi chapter. 6

17 7 2.1 Reconfigurable Computing Thi ection preent ome hitory on the germination, progre and recent exploion of thi relatively new computing paradigm. Thi ection alo decribe the rie in uage of programmable logic device in general with time and conclude with a detailed dicuion on the enabling technology for RC: FPGA The Era of Programmable Hardware Device From the infancy of Application-Specific Integrated Circuit (ASIC) the deigner could foreee a need for chip with pecialized hardware that would provide enormou computational potential. The tate of the art of IC fabrication technology limited the amount of logic that could be packed into a ingle chip. During the 1980 and 90 the fabrication technology matured and aw dratic improvement in many fabrication procee, and the era of VLSI began. The high development and fabrication cot tarted dropping a the 90 aw an exploion in the demand of uch product. The 1980 and 90 alo aw birth of the killer microproceor [10] which tarted capturing a big portion of the market. The fater time to market for the product motivated many to forgo ASIC and adopt general-purpoe proceor (GPP) or pecialpurpoe proceor uch a digital ignal proceor (DSP). While thi approach provided a great deal of ucce in everal application domain with relative implicity, it wa never able to match the performance of the pecialized hardware in the highperformance computing (HPC) community. Real-time ytem and other HPC ytem with heavy computational demand till had to revert to ASIC implementation. To overcome the limitation of high non-recurring engineering (NRE) cot and long development time, an alternative methodology wa developed: Programmable Logic Device (PLD). PLD tarted playing a major role in the early 90. Since they provided

8 fater deign cycle and mitigated the initial cot, they were oon adopted a inexpenive prototyping tool to perform deign exploration for ASIC-baed ytem.

18 8 fater deign cycle and mitigated the initial cot, they were oon adopted a inexpenive prototyping tool to perform deign exploration for ASIC-baed ytem. A the technology matured, the application of PLD expanded beyond their role a place holder into eential component of the final ytem [11]. Due to their ability of being programmed in-field, developer could foreee the PLD playing a major role in HPC where they could offer many advantage over conventional GPP and ASIC. The GPP and the ASIC have exited at two extreme on the pectrum of computational reource. The key trade-off ha been that of flexibility, where GPP have held the lead in the market, and performance, where the ASIC have overhadowed the former. PLD (alo known a RC engine becaue of their ability to be reprogrammed) have made a trong impact in the market by providing the bet of both the world. Figure 2.1. Variou computation platform on the flexibility and performance pectrum

19 The Enabling Technology for RC: FPGA A gate denity further improved, a particular family of PLD, namely FPGA, became a particularly attractive option for reearcher. An FPGA conit of an array of configurable logic element along with a fully programmable interconnect fabric capable of providing complex routing between thee logic element [12]. Figure 2.2 preent an overimplified tructure of an FPGA. The routing reource repreented in the diagram by horizontal and vertical wire that run between the Configurable Logic Block (CLB) conume over 75% of the area on the chip. The flexible nature of an FPGA architecture provide a platform upon which application could be mapped efficiently. The ability to reconfigure the logic cell and the connection allow for modifying the behavior of the ytem even after deployment. Thi feature ha important implication for upporting multiple application, a an FPGA-baed ytem could eliminate the need of creating a new ytem each time a new application i to be ported. I/O I/O I/O I/O I/O I/O I/O CLB CLB CLB Configurable Logic Block: CLB I/O LUT + FlipFlop I/O CLB CLB CLB CLB I/O Programmable Connection I/O CLB CLB CLB CLB I/O I/O CLB CLB CLB CLB I/O I/O Block: acce to externalpin I/O I/O I/O I/O I/O I/O Figure 2.2. A implitic view of an FPGA and a logic cell inide that form the baic building block [13].

20 10 Modern FPGA are embedded with a hot of advanced proceing block uch a hardware multiplier and proceor core to make them more amenable for complex proceing application. One of the everal advantage that FPGA offer over conventional proceor i that they are maive computing machine that lend themelve well to application with inherent fine-grain parallelim. Becaue a farm of CLB can operate completely independently, a large number of operation can take place on-chip imultaneouly unlike in mot other computing device. The ability of concurrent computation and upport for high memory bandwidth offered by internal RAM in FPGA offer them an edge over DSP for everal ignal proceing application. Highly pipelined deign help further in overlapping and hiding latencie at variou proceing tep. The execution time for the control ytem oftware i difficult to predict on modern proceor becaue of cache, virtual memory, pipelining and everal other iue which make the wort-cae performance ignificantly different from the average cae. In contrat, the execution time on the FPGA i determinitic in nature, which i an important factor for time-critical application. Although FPGA provide a powerful platform for efficiently creating highly optimized hardware configuration of application, the proce of configuration generation can be quite labor-intenive. Hence, reearcher have been looking for alternative way of porting application with relative eae. Several graphical tool and higher-level programming tool are being developed by vendor that peed up the deign cycle of porting an application on the FPGA. Thi thei make ue of one uch highlevel programming tool called Handel-C (tarted a a project at Oxford Univerity and

21 11 later developed into a commercial tool by Celoxica Inc.) [8] to enable fat prototyping and architectural analyi. Handel-C provide an extenion and omewhat of a uperet of tandard ANSI C including additional contruct for communication channel and parallelization pragma while imultaneouly removing upport for many ANSI C tandard librarie and other functionality uch a recurion and compound aignment. DK the development environment that upport Handel-C, provide floating-point and fixed-point librarie. The compiler can produce yntheizable VHDL or an EDIF netlit and upport functional imulation. Handel-C and it correponding development environment have been ued previouly in numerou other project including image proceing algorithm [14] and other HPC benchmark [2]. The mot common way in which RC-baed ytem exit today are a extenion to conventional proceor. The FPGA are integrated with memory and other eential component on a ingle board which then attache to the hot proceor through ome interconnect uch a a Peripheral Component Interconnect (PCI) bu. Thi work make ue of an RC1000 board, a PCI-baed board developed by Celoxica Inc. equipped with a VirtexE 2000 FPGA. Hardware configuration for the board can be generated uing the DK development environment or following the more conventional deign path of VHDL. Figure 2.3 how a block diagram of the RC1000 board. The card conit of 8MB of memory that i organized into four independently acceible bank of 2MB each. The memory i acceible both to the FPGA and any other device on the PCI bu. The FPGA ha two clock input on the global clock buffer. The firt pin derive it clock from a programmable clock ource or an external clock, wherea the econd pin derive it clock from the econd programmable clock ource or

22 12 the PCI bu clock. The board upport a variety of data tranfer over the PCI bu, ranging from bit- and byte-wide regiter tranfer to DMA tranfer into one of the memory bank. Recently, reearcher have felt the need for developing tand-alone FPGA ytem that could be acceed over traditional network interface. Such an autonomou ytem with an embedded proceor and an FPGA [15] (a novel concept developed by reearcher at the HCS Lab known a the Network-Attached Reconfigurable Computer or NARC) offer a very cot-effective olution for the embedded computing world, epecially where power and pace are at a premium. Hot Primary PCI Secondary PMC #1 PCI-PCI Bridge PMC #2 PLX PC Clock & Control SRAM Bank 512kx32 SRAM Bank 512kx32 SRAM Bank 512kx32 Xilinx BG560 V200E Iolation SRAM Bank 512kx32 Linear Regulator +3v3/+2v5 Iolation Auxiliary I/O Figure 2.3. Block diagram decription of RC1000 board. Courtey: RC1000 reference manual [16].

23 Remote-Sening Tet Application: Data Fuion In the pat couple of decade, there ha been a tremendou amount of reearch in enor technology. Thi reearch ha reulted in a rapid advancement of related technologie and a plethora of enor in the market that differ ignificantly in the quality of data they collect. One of the mot important application that have attracted overwhelming attention in remote ening i that of mapping topographie and building digital elevation map of region of earth uing different kind of enor. Thee map are then employed by reearcher in different dicipline for variou cientific application (e.g. oceanography for etimation of ocean urface height and behavior of current, in geodey for etimation of earth gravitational equipotential, etc.). Traditionally, atellite-baed ytem equipped with enor like InSAR and Topographic Synthetic Aperture Radar (TOPSAR) had been employed to map extended area of topography. But, thee enor lacked high accuracy and produced image of moderate reolution over the region of interet. Recently, ALSM ha emerged a an important technology for remotely ening topographie. ALSM enor provide extremely accurate and high reolution map of local elevation, but operate through a very exhautive proce which limit the coverage area to maller region. Becaue of the varying nature of the data produced by thee enor, reearcher have been developing algorithm that fue information obtained at different reolution into a ingle elevation map. A multicale etimation framework developed by Feiguth [17] ha been employed extenively over the pat decade for performing efficient tatitical analyi, interpolation, and moothing. Thi framework ha alo been adopted to fue ALSM and InSAR data having different reolution to produce improved etimate over large coverage area while maintaining high reolution locally.

14 2.2.1 Data Acquiition Proce Before delving into the mathematic of the etimation algorithm, a brief decription of the current proce

The proce of collecting ALSM data involve flying the enor in an aircraft over the region of interet.

orthogonal to movement of the aircraft. g N x o pt.

The tep involved in data acquiition and data proceing With the help of everal other poition enor and movement meaurement intrument,

24 Data Acquiition Proce Before delving into the mathematic of the etimation algorithm, a brief decription of the current proce of data acquiition i preented to aid in the undertanding of thi work and the motivation behind it. The proce of collecting ALSM data involve flying the enor in an aircraft over the region of interet. A the aircraft travel through each flight line, the enor can the region of interet below in a rater can fahion in a direction orthogonal to movement of the aircraft. g N x o pt... g 2 weight function m= m= m= m= m= m= g 1 m= 2 2 m= Figure 2.4. The tep involved in data acquiition and data proceing With the help of everal other poition enor and movement meaurement intrument, the XYZ coordinate of mapped topography are generated and tored on dik in ASCII format. Thee XYZ coordinate are then ued to generate a dene 3-D point cloud of irregularly paced point. Thi 3D data et when gridded in X and Y direction at varying ditance grid reult in 2-D map of correponding reolution.

25 15 Thee image are then employed in the multicale etimation framework with the SAR image to fue the data et and produce improved map of topography. Becaue of the lack of proceing capabilitie on aircraft, thee operation cannot be performed in realtime and hence data i tored on dik and proceed offline in ground tation. Several application could be made poible if on-board proceing facilitie were made available on the aircraft. Such a real-time ytem would offer everal advantage over the conventional ytem. For example, it could be ued to change the data collection trategy or repeat the proce over certain elected region in repone to the quality of data obtained. RC-baed platform a decribed in the previou ection form a perfect fit for being deployed in uch ytem. Although thi work cloely deal with an aircraftbaed target ytem, the iue involved are very generic and apply to mot other remote ening ytem with ome exception. A a reult ome iue might not be addreed in thi work, for example the effect of radiation which have important implication on atellite-baed ytem, requiring ome kind of redundancy i provided to overcome ingle event upet (SEU), do not affect an aircraft-baed ytem. While it i deirable to deign a complete ytem that could be deployed on-board an aircraft, doing o would entail a plethora of implementation iue that divert the focu from the more intereting apect of thi work explored through reearch. Hence, intead of building an end-to-end ytem, thi work will focu on the data fuion application employed in the entire proce (Figure 2.4) and will ue it a a tet cae to analyze the feaibility of deploying RC-baed ytem in the remote ening arena. The following ub-ection provide a decription of thi data fuion algorithm.

26 Data Fuion: Multicale Kalman Filter and Smoother The multicale model which are the focu of thi thei were propoed by Fieguth et al. [17] and provide a cale-recurive framework for etimating topographie at multiple reolution. Thi multireolution etimation framework offer the ability of highly efficient tatitical analyi, interpolation, and moothing of extremely large data et. The framework alo enjoy a number of other advantage not hared by other tatitical method. In particular, the algorithm ha complexity that grow only linearly with number of leaf node. Additionally, the algorithm provide interpolated etimate at multiple reolution along with the correponding error variance that are ueful in aeing the accuracy of the etimate. For thee reaon, and many more, reearcher have adopted thi algorithm for variou remote-ening application. Mutlicale Kalman moother modeled on fractional Brownian motion are defined on index et, organized a multi-level quad-tree a hown in the Figure 2.5. The multicale etimation i initiated with a fine-to-coare weep up the quad-tree that i analogou to Kalman filtering with an added merge tep. Thi fine-to-coare weep up the quad-tree i followed by a coare-to-fine weep down the quad-tree that correpond to Kalman moothing. m = 0 m = 1 m = 2 Figure 2.5. Quad-tree data tructure where m repreent the cale.

27 17 The tatitical proce defined on the tree i related to the obervation proce and ha the coare-to-fine cale mapping defined a follow x ( ) = A( ) x( γ ) + B( ) w( ) (1) y ( ) = C( ) x( ) + v( ) (2) where γ repreent an abtract index for repreenting a node on the tree repreent the lifting operator, γ repreent the parent of x() repreent the tate variable y() repreent the obervation (LIDAR or INSAR) A() repreent the tate tranition operator B() repreent the tochatic detail caling function C() repreent the meaurement tate relation w() repreent the white proce noie v() repreent the white meaurement noie α repreent the lowering operator, α n repreent the n th child of q repreent the order of the tree, i.e. the number of decendant a parent ha The proce noie w(), i Gauian with zero mean and the variance given by the following relation Ε [ δ (3) t w ( ) w( t) ] = Ι, t w ( ) ~ N(0, Ι) (4) The prior covariance at the root node i given by x = x 0) ~ N(0, ) (5) 0 ( Ρ0

28 18 The parameter A (), B (), C () that define the model need to be choen appropriately to match the proce being modeled. The tate tranition operator wa choen to be 1 to create a model where the child node are the true value of the parent node offet a mall value dependent on the proce noie. The parameter B() i obtained uing power pectral matching or fractal dimenion claification method. The meaurement tate relation matrix C() wa aigned a 1 for all pixel to repreent the cae where obervation are preent at all pixel without any data dropout. Correponding to any choice of the downward model, an upward model on the tree can be defined a in Fieguth et al. [17] x ( γ ) = F( ) x( ) + w( ) (6) y ( ) = C( ) x( ) + v( ) (7) T 1 = P A ( ) P F( ) γ (8) where, Ε[ w( ) w T ( )] = P γ = Q( ) (1 A T ( ) P 1 A( ) P t P i the covariance of the tate defined a Ε [ x ( ) x( ) ]. γ ) (9) Now, the algorithm can proceed with the two tep outlined above (upward and downward weep) after initializing each leaf node with prior value, x ˆ ( + ) = 0 (10) P ( +) = (11) P A) Upward weep The operation involved in the upward weep are very imilar to a 1-D Kalman filter which form the heart of the computation and can be perceived a running along the cale

29 19 on each pixel with an additional merge tep after every iteration. The calculation performed at each node are a follow ) ( ) ( ) ( ) ( ) ( R C P C V T + + = (12) ) ( ) ( ) ( ) ( 1 V C P K T + = (13) ) ( )] ( ) ( [ ) ( + = P C K I P (14) )) ) ˆ( ( ) ( )( ( ) ˆ( ) ( ˆ = x C y K x x (15) The Kalman filter prediction tep i then applied at all node except the leaf node which were initialized a mentioned above ) ) ˆ( ( ) ( ˆ i i i i x F x α α α α = (16) ) ( ) ( ) ( ) ( ) ( i i T i i i i Q F P F P α α α α α α + = (17) Thi lead to a prediction etimate of the parent node from each decendant (1 q), which are then merged into a ingle etimate value to be ued in the meaurement update tep. = + = + q i i i x P P x 1 1 ) ) ˆ( ( ) ( ) ( ˆ α α (18) = + = + q i i P P q P )] ( ) [(1 ) ( α (19) Thi proce i iterated over all the cale (m) until the root node i reached. Thi proce lead to the generation of etimate at multiple cale baed on the tatitical information acquired from the decendant layer. The completion of upward weep lead to a moothed etimate, at the root node. 0) ˆ(0 (0) ˆ x x = B) Downward Sweep

30 20 The moothed etimate for the remaining node are computed by propagating the information back down the tree in a moothing tep. xˆ ( ) = x( ) + J ( )[ xˆ ( γ ) xˆ( γ )] (20) T P ( ) = P( ) + J ( )[ P ( γ ) P( γ )] J ( ) (21) T 1 J ( ) = P( ) F ( ) P ( γ ) (22) where xˆ ( ) repreent the moothed etimate P () repreent the correponding error variance It i worth mentioning here that the et of computation outlined above, depite being cloely coupled, have two independent proceing path. Thi fact i further exploited in Chapter 3 and 4 where the deign for targeting the hardware are explored. 2.3 Related Reearch There ha been an abundance of publication over the pat decade by reearcher who have tried accelerating variou image proceing algorithm on FPGA. There ha alo been a good deal of academic and indutry effort in deploying uch dual-paradigm ytem (which make ue of both conventional proceing reource and RC-baed reource) in pace for improving on-board performance. However, only a limited amount of work ha been done in undertanding the nature of proceing involved in uch application to identify their pecialized demand from the target ytem. One of the earliet work on deigning a Kalman filter for an FPGA wa by Lee and Salcic in 1997 [18], where they attempted to accelerate a Kalman tracking filter for multi-target tracking radar ytem. They achieved an order of magnitude improvement with their deign that pread over ix 8000-erie Altera chip in comparion to previou

31 21 attempt in [19-22] that targeted tranputer, digital ignal proceor, and linear array for obtaining improved performance over oftware implementation. In an application note from Celoxica Inc. [23], Chappel, Macarthur et al. preent a ytem implementation for boreighting of enor aperture uing a Kalman filter for enor fuion. The ytem utilize a COTS-baed FPGA which embed a 32-bit oftcore proceor to perform the filtering operation. Their work erve a a claical example of developing a low-cot olution for embedded ytem uing FPGA. There have alo been other work [24-25] that perform Kalman filtering uing an FPGA for olving imilar problem uch a implementation of a tate pace controller and real-time video filtering. In [25], Turney, Reza, and Delva have atutely performed pre-computation of certain parameter to reduce the reource requirement of the algorithm. The floating-point calculation involved in ignal proceing algorithm are not amenable to FPGA or hardware implementation in general, o reearcher have reorted to fixed-point implementation and have been exploring way to mitigate the error hence induced. In [18], Lee and Salcic employ normalization of certain coefficient involved by the proce variance to maximize data accuracy with a fixed number of bit or minimize the reource requirement for a certain level of accuracy. There ha alo been plenty of work done on the algorithmic ide to overcome uch effect. In [26], Scharf and Siggurdon preent a tudy of caling rule and round off noie variance in a fixed point implementation of a Kalman filter. The 1-D Kalman filter involve heavy equential proceing tep and hence cannot fully exploit the fine-grain parallelim available in FPGA. The multicale Kalman filter by contrat involve independent operation on multiple pixel in an image and offer a

32 22 high degree of parallelim (DoP) that i repreentative of the cla of image proceing algorithm. The poibility of operating on multiple pixel in parallel ha motivated many reearcher to target different imaging algorithm on FPGA. Reearcher [2, 4, 5, 9] have preented everal example illutrating the performance improvement obtained by porting imaging algorithm like 2-D Fat Fourier Tranform, image claification, filtering, 2-D convolution and edge detection on the FPGA-baed platform. Dawood, William and Vier have developed a complete ytem [27] for performing image compreion uing FPGA on-board a atellite to reduce bandwidth demand on the downlink. Several reearcher have even developed a high-level environment [6-7] to provide the application programmer a much eaier interface for targeting FPGA for imaging application. They achieve thi goal by developing a parameterized library of commonly employed kernel in ignal/image proceing, and thee core can then be intantiated from a high-level environment a needed by the application. Employing RC technology in the remote ening arena i not a new concept and everal attempt have been made previouly to take advantage of thi technology. Buren, Murray and Langley in their work [28] have developed a reconfigurable computing board for high-performance computing in pace uing SRAM-baed FPGA. They have addreed the pecial need for uch ytem and identified ome key component that are eential to the ucce, uch a, the ue of high-peed dedicated memorie for FPGA and high I/O bandwidth and upport for periodic reloading to mitigate radiation effect being ome of them. In [3], Arriba and Macia have developed an FPGA board for a real-time viion development ytem that ha been tailored for the embedded environment. Beide the academic reearch community, indutrie have alo hown

33 23 keen interet in the field. Honeywell ha developed a Honeywell Reconfigurable Space Computer (HRSC) [29] board a a prototype of the RC adaptive proceing cell concept for atellite-baed proceing. The HRSC incorporate ytem-level SEU mitigation technique and facilitate the implementation of deign-level technique. In addition to hardware development reearch, abundance of work ha been done on developing hardware configuration for variou remote-ening application. In [1], a SAR/GMTI range compreion algorithm ha been developed for an FPGA-baed ytem. Sivilotti, Cho et al., in their work in [30], have developed an automatic target detection application for SAR image to meet the high bandwidth and performance requirement of the application. Other work in [27, 31-33] dicu the iue involved in porting imilar application uch a geometric global poitioning, onar proceing, etc. on the FPGA. Thi thei aim to further the exiting reearch in thi field by developing a multicale etimation application for an FPGA-enabled ytem and exploring different architecture to meet the ytem requirement.

34 CHAPTER 3 FEASIBILITY ANALYSIS AND SYSTEM ARCHITECTURE Thi chapter preent a dicuion on ome of the iue involved and ome initial experiment performed for feaibility analyi. The reult of thee tet influenced the choice of deign parameter and architectural deciion made for the hardware deign of the algorithm preented in the next chapter. 3.1 Iue and Trade-off Mot ignal proceing algorithm executed on conventional platform employ double-preciion, floating-point arithmetic. A pointed out earlier, uch floating-point arithmetic i not amenable to hardware implementation on FPGA, a they have large logic area requirement (a more detailed comparion i preented in the next ubection). Carrying out thee proceing tep in fixed-point arithmetic i deirable but introduce quantization error which if not controlled can lead to error large enough to defeat the purpoe of hardware acceleration. Hence, there exit an important trade-off between the number of bit ued and the amount of logic area required. There are technique that mitigate thee effect by modifying certain part of the algorithm itelf. Example of uch technique include normalizing of different parameter to reduce the dynamic range of variable and uing variable bit preciion for different parameter, uing more bit for more ignificant variable. Thi work make ue of fixed- and floating-point librarie available in Celoxica DK package. To perform experimental analyi, the imulated data i generated uing Matlab (the value for the imulated data were choen to cloely repreent the true value of data acquired from the enor). Hence, a procedure i 24

35 25 required to verify the model in the FPGA with the data generated in Matlab. Text file are ued in thi work a a vehicle to carry out the tak. Since Matlab produce doublepreciion, floating-point data while the FPGA require fixed-point value, extra proceing i needed to perform the converion. Matlab imulation and.txt file containing data Hardwar e Handel- C/VHDL Figure 3.1. Data generation and validation Another important iue that can become a ignificant hurdle arie from the nature of the proceing involved in the Kalman filtering algorithm. The Kalman filter equation are recurive in nature and require the current iteration etimate value to begin the next tate calculation. Thi behavior i clearly viible from Figure 3.2 which how the proceing tep in a time-tracking filter. K Pk k = HkPk H T k T k H + R k x P k + 1 k + 1 = Φ k k x k k = Φ P Φ T k + Q k Initial prior value Meaurement update Time update xˆ k = xˆ + ( ˆ k Kk zk Hk xk ) k = ( Ι K k H k Pk P ) x k repreent the tate variable z repreent the obervation input k Φ repreent the tate tranition operator P k Q k R k repreent aociated error covariance repreent proce noie variance repreent meaurement noie variance Figure 3.2. Set of equential proceing tep involved in 1-D Kalman filter

36 26 Thi problem cannot be mitigated by pipelining the different tage, becaue of the data dependency that exit from the lat tage of the pipeline to the firt tage. Although thi can be a major roadblock for 1-D tracking algorithm, the ituation can be made much better in the multicale filtering algorithm becaue of the preence of an additional dimenion where the parallelim can be exploited along the cale. Multiple pixel in the ame cale can be proceed in parallel a they are completely independent of each other. The number of uch parallel path would ultimately depend on the amount of reource required by the proceing path of each pixel pipeline. Another intereting but ubtle trade-off exit between the memory requirement and the logic area requirement. Hardware architecture of the algorithm could be made to reue ome reource on the chip (e.g. the arithmetic operation that are intantiated could be reued, epecially module uch a multiplier and the divider which conume exceive area) by aving the intermediate reult in the on-board SRAM. Thi approach decreae the logic area demand of the algorithm, but at the cot of increaed memory requirement and alo the extra clock cycle required for computing each etimate. 3.2 Fixed-point v. Floating-point arithmetic A tudy wa performed to compare the reource demand poed by floating-point operation a oppoed to fixed-point and integer operation of equal bit width. Since a diviion operation i a very expenive operation in hardware, it yield more meaningful difference and wa hence choen a the tet cae. IEEE ingle-preciion format wa employed for floating-point diviion. A Xilinx VirtexII chip wa choen a the target platform for the experiment to exploit the multiplier component preent in the device (and alo partly due to the fact that the current verion of the tool doe not upport

37 27 floating-point diviion on any other platform). Table 3.1 compare the reource requirement for the different cae. Table 3.1. Pot place and route reult howing the reource utilization for a bit divider and the maximum frequency of operation. Target Chip: VirtexII 6000 Package: ff1152 Speed Grade: -4 Slice (Total 33792) 18-bit 18-bit Multiplier (Total 144) Integer (32 bit) Fixed-point ( 16 bit before and after binary point) 19 (1%) 84 (1%) 487 (1%) 3 (2%) 6 (4%) 4 (2%) Max. frequency 63.2 MHz 50.5 MHz 97.5 MHz IEEE ingle-preciion floating-point (32 bit) High cot involved in the floating-point operation are clearly viible from the table. The high frequency obtained for the floating-point unit, which appear a an anomaly, merely repreent the efficiency of the core ued by the library in the tool. Once the cot aving obtained by reorting to the fixed-point operation have been identified, we need to undertand the error introduced through thi proce. To analyze thi error, multiple deign of the 1-D filter were developed with different bit width of fixed-point implementation in Handel-C. Although imulation wa adopted to generate the output, the deign were made to cloely repreent hardware model uch that minimal change could tranlate them into hardware configuration. The imulation output were compared with Matlab double-preciion floating reult. Mean quare error (MSE) between the filter etimate and the actual expected output were ued a a metric for comparion a hown in Table 3.2.

38 28 Table 3.2. Quantization error introduced by fixed-point arithmetic (average are over 100 data point). Firt column define the number of bit before and after the binary point. Fixed-point preciion MSE in Matlab (double-preciion) MSE in Handel-C and (% error from Matlab) Mean quare error from Matlab Max. ab. error from Matlab (3%) e (300%) (460%) Although the maximum and average toleration level of error i largely dependent on the nature of the application, it i clearly evident from Table 3.2 that eight bit of preciion after the binary point yield reaonable performance with le than 0.5% of maximum abolute error. A i alo viible from the table, the accuracy decreae rapidly with reduction in bit width. The number of bit required before the binary point wa kept contant a it i dictated by the dynamic range of the data et and not by the nature of proceing. Thee obervation influenced the election of eight bit of preciion after the binary point for the hardware architecture of the multicale filter. Figure 3.3 depict the ame information by plotting the difference between the reult obtained from Matlab floating-point and Handel C fixed-point implementation (with 8 bit preciion) for each data point of the time erie. It i worth mentioning here that the quantization error value preented in the graph and table are pecific to the Kalman filter and depend on the type of proceing involved (the multiplication and diviion having the wort effect on quantization error). The recurive nature of proceing involved over the cale in a multicale filter may tend to accumulate the error at each cale and lead to larger error value than thoe preented in thi ection.

39 Abolute error Iteration Figure 3.3. Difference between Matlab computed etimate and Handel-C fixed point etimate uing eight bit of preciion after the binary point for 100 data point. 3.3 One-dimenional Time-tracking Kalman Filter Deign A 1-D time tracking filter (repreented by the Equation in Figure 3.2) wa deigned for the RC1000 FPGA board with a VirtexE 2000 chip (more detail in Chapter 2). The deign could not be pipelined becaue of the exitence of wort-cae data dependence from the output of the current iteration to the input of the next iteration. Performing all of the computation in hardware proved exorbitantly bulky and occupied about 12% of the lice for one intantiation of the filter. In addition, the long combinational delay introduced by the diviion operator lead to a coniderably low deign frequency of about 3.9 MHz. To make thee deign of any practical importance, we need to overcome thee problem or find a way to mitigate their effect. Reviiting the algorithm and taking a cloer look at the equation involved yield ome intereting information.

40 30 R + H X Pp F Q X X / K 1 - X X X + Ppnew H X Xp Z - + X Xpnew K X F Figure 3.4. Proceing path for (a) calculation of parameter and (b) calculation of the etimate. The block diagram of computation tep reveal the exitence of two independent proceing chain. Thi fact ha important implication for future hardware deign a it allow the reduction of the actual amount of computation that need to be performed in hardware. The calculation of the etimate error covariance (P pnew ) and the filter gain (K) i completely independent of obervation input and generated etimate and hence can be done offline even prior to data collection. The pre-computation of thee filter parameter ha multiple advantage. It reduce the logic area requirement from about 12% to under 3% and eliminate the diviion operator from the critical path increaing the maximum deign frequency to about 90 MHz. In addition, it alo allow the poibility of changing the filter parameter by merely replacing the et of the pre-computed parameter, introducing a ort of virtual reconfiguration capability where the behavior of the filter change without reconfiguring the FPGA. But thee benefit come at the cot of extra

41 31 memory requirement for torage of the filter parameter, a trade-off that wa mentioned earlier. The reduction in area i an important conideration for the 2-D filter deign a it allow a larger number of pixel to be proceed in parallel. With thee modification a 1-D filter wa developed for the RC1000 board and performance experiment were conducted with data et containing 600 data point. The latency incurred for tranferring all the value, one byte at a time, over the PCI bu hamper the performance. To overcome thi limitation, DMA i ued to tranfer all the data value onto the on-board SRAM. The performance reult for both the DMA and non-dma cae are hown in Table 3.3 below and compared againt Matlab reult. The timing reult in Matlab yielded variation in multiple trial (which i attributed to the way Matlab work internally and handle memory). For thi reaon further experiment were performed uing a C-baed olution for obtaining oftware execution time. Table 3.3. Performance reult tabulated over 600 data point (lice occupied: 2%) Code verion Execution time MSE Matlab (running on 3GHz) 0-16 m FPGA (non-dma) 49.5 m FPGA (DMA) 1 m Component of FPGA execution time (DMA cae) DMA write of 600 data value: 64 u DMA read of 600 value: 44 u Computation time for 600 value: 285 u 3.4 Full Sytem Architecture Figure 3.5 depict a high-level block diagram of the ytem architecture. To provide more functionality and enhance the uability, a diplay mechanim may be included in the final ytem depending on the need of the application. Since thi work focue on jut a part of the entire ytem (data fuion/etimation), the input are not

42 32 directly obtained from the enor. Intead, they go through everal tage of proceing before they are converted in a form compatible to be wired with the ytem hown in Figure 3.5. Mot of thee proceing tage are alo performed in ground-baed tation currently, but ome of the proceing i alo performed on-board the aircraft. Similarly, the output could be ued directly for providing viual feedback or may need to pa through ome further pot proceing tage before being in a directly uable format. Pre- proceing Raw image Memor y Filter parameter Input Filter Logic Off - board Input FPGA Outp Fixed-point to real number converion Filtered image data to diplay unit Video controller Figure 3.5. High-level ytem block diagram. A hown in the diagram, the ytem heavily depend on on-board memory which preferably hould be organized in multiple independently addreable bank. The filter parameter correponding to a choen filter model will be tored in one/multiple memory bank(). Thee filter parameter define the behavior of the algorithm and hence multiple et of uch parameter could be tored on off-board memory and tranferred into on-

43 33 board memory a needed. The input image coming from one of the pre-proceing block i ditributed in one of the multiple memory block reerved for input by a memory controller. The proviion of multiple input bank allow the overlapping input data tranfer time with the computation time of the previou et of data. Spreading the filter parameter in multiple bank alo aid in reducing the computation time by allowing multiple parameter to be read in parallel. The tet ytem on which all the experiment are performed conit of a PCI-baed card reiding in a conventional Linux Server. Hence it may not exactly mirror the ytem jut outlined, and could involve ome additional iue uch a limited memory, input/output tranfer latencie over the PCI bu, etc. The goal of thi work include identifying limitation in the current ytem that hamper the performance, and peculating additional feature that can enhance the efficiency of the ytem, on the bai of the reult obtained from experiment.

44 CHAPTER 4 MULTISCALE FILTER DESIGN AND RESULTS Thi chapter preent different hardware deign developed for the multicale Kalman filter. Each ubequent deign explore the opportunity to further improve performance and build on the hortcoming dicovered in the previou deign. Reult of the timing experiment are preented and analyzed to ae performance bottleneck. 4.1 Experimental Setup Before preenting the hardware deign developed for the RC1000 target platform, a brief dicuion about the input data et that wa ued i preented in order to et the experiment up for reult and analyi. The tet data wa generated by imulation from Matlab to emulate the Digital Elevation Map (DEM) obtained over a topographical region Figure 4.1. Simulated obervation correponding to reolution cale. 34

45 35 The highet reolution obervation wa choen to have a upport of pixel and repreent the data et correponding to the one generated by an ALSM enor. Thi image reolution hence give rie to nine cale in the quad tree tructure. Another et of obervation wa generated for a more coare cale having a upport of repreenting the data generated from INSAR. Figure 4.1 depict the (finer cale) data et, which can be een to have four varying level of roughne for different region of the image. Such a tructure wa choen to incorporate the data correponding to different kind of terrain uch a plain graland (mooth) and pare foret (rough) in a ingle data et. Thi tructure of the imulated obervation wa created by following the fractional Brownian model and populating the node in the quad tree tructure tarting from the root node uing equation from Section A with the cae of 1-D filtering, pre-computation i employed in thi cae a well to reduce the reource requirement. Hence, the filter parameter needed for the online computation of etimate are alo generated uing equation in Section (namely K ( ), C( ), F( ), P( / + ), P( / α) ). The parameter along with the obervation et required approximately 870 KB of memory when repreented in the 8.8 fixed-point format. The mall footprint of data, fed a input to the FPGA, provide everal opportunitie of exploiting the on-board memory in different way a demontrated further in the chapter. Some deign might require additional torage becaue of detail of the architecture. It i alo worth noting that the tructure of computation in the 2-D filter though imilar to the 1-D cae ha extra operation due to the merge tep involved for combining the child node into a parent pixel.

46 Deign Architecture #1 The block diagram repreenting the hardware configuration (a for RC1000 card) of thi deign i hown in Figure 4.2. Pre-proceed filter parameter are tranferred via DMA into one of the SRAM bank. Although the tranfer ha to go over the PCI bu and incur a ubtantial latency in the experimental ytem, thi overhead ha been eparated in the reult becaue the actual target ytem might have fater mean of uch tranfer. Beide, the latency i jut a ingle time effect and can be avoided by preloading the parameter of the deired filter model. Hence, the input latency will only be viible in the virtual reconfiguration cae where a new et of parameter need to be tranferred into the on-board memory. Figure 4.2. Block diagram of the architecture of deign #1. Thi deign baically exploit the data parallelim available in the algorithm by pipelining the proceing over multiple pixel. The pipeline perform the computation for four pixel imultaneouly a thi lend itelf well to the quad tree tructure of the application. The major performance hurdle i experienced at the memory port. Since each tage of the pipeline proceing require ome parameter to be read from memory, a reource

47 37 conflict exit. Hence, a tall i required after every tage of computation which effectively break down the pipeline and it aociated advantage to a large extent. Table 4.1 compare the proceing time for the algorithm on a 2.4 GHz Xeon proceor with the time on the RC1000 board. The FPGA on the board i clocked at 30 MHz, and better performance can be obtained by increaing the deign clock, which can be achieved by uing fater, more advanced chip and by further optimizing the deign. Table 4.1. Performance comparion of a proceor with an FPGA configured with deign #1. Execution time on Single cale ( ) Multiple cale (till 4 4) RC m 20.5m 2.4 GHz Xeon Proceor 9.85m 13.49m Reource Utilization: Slice : 3286 out of (17%) Memory : 850KB approx. (filter parameter) : 170KB approx. (output) DMA latency for ending the data over the PCI bu: 1 cale (650KB): 3.1m All cale (870KB approx): 3.9m The time are compared for both ingle and multiple cale of proceing. The computation i terminated when the image upport reduce to jut 4 4 becaue the overhead dominate actual computation time. The amount of reource occupied by thi configuration i alo lited beide the table. With jut 17% of logic utilization for the proceing of four pixel, enough area i left to provide the opportunity for increaing the amount of concurrent computation by adding more pixel in the proceing chain. The value in the table how that the FPGA-baed filter perform about 1.5 time fater than a conventional proceor. In the embedded ytem arena the abolute performance of a ytem i le relevant than the performance per cot and i conidered a a better metric for comparion. Similarly, raw performance improvement might not reult in an order of magnitude peedup, but they do come at about one hundredth of the running cot of a

48 38 competing ytem. The leon learned from thi deign point to the fact that memory bandwidth i a crucial factor for obtaining better performance for the application. The reource hazard need to be eliminated to take complete advantage of the pipelined tructure. The previou dicuion related to the Kalman filtering involved in the application which populate the node on the tree going upward. Thi application alo involve a moothing tep to generate the etimate while travering the tree from top to bottom. Recurive application of the filtering pipeline generate multiple et of etimate at different cale, which are then ued by the computation in the econd tep. The tructure of calculation involved i imilar to the filtering tep, but require ome intermediate data value to be aved in addition to the output a in the previou cae. Thi further increae the memory bandwidth demand of the ytem, but ince thee calculation only begin after the completion of the upward tep, they are not in conflict. Figure 4.3 how the modification required to incorporate thee effect. The additional data ha been tored in the empty memory bank, which allow them to be read in parallel for the moothing pipeline without any tall cycle. Figure 4.3. Block diagram for the architecture of both the Kalman filter and moother.

39 Another et of parameter (alo decribed in equation from Section 2.2.1) i required for the moothing operation and i alo tored in the ame memory bank with other parameter.

49 39 Another et of parameter (alo decribed in equation from Section 2.2.1) i required for the moothing operation and i alo tored in the ame memory bank with other parameter.the deign hown wa pread acro two chip by having both of the operational pipeline a independent deign. Thi reult wa achieved by creating two eparate FPGA configuration file and reconfiguring the FPGA on an RC1000 board with the econd file after the completion of the upward weep tep, in effect emulating a multi- FPGA ytem. Thi technique allow for higher computational potential and alo provide the poibility of pipelining the upward and downward weep operation for multiple data et on a higher conceptual level. For thi part of the experiment, the obervation were jut limited to the finet cale (repreenting the LIDAR data) which implie that no additional tatitical information i incorporated in the filtering tep except for the finet cale. Hence, the moothed etimate could be obtained by performing the moothing from one cale coarer to the obervation cale. Exitence of obervation at multiple cale may have more than a ingle advantage in everal cae; it not only increae the amount of available computation to be exploited but alo help in mitigating the preciion effect that tend to accumulate over the cale in the application. Error Statitic: Maximum abolute error from Matlab: MSE: Maximum error percentage: <2% in mot cae Figure 4.4. Error tatitic of the output obtained from the fixed-point implementation.

50 40 The entire application, coniting of both the filtering and moothing tep, require a total of about 23% (17% for filtering + 6% for moothing) of the lice, proceing four pixel imultaneouly. Figure 4.4 compare the output obtained from the hardware verion with the MATLAB double-preciion, floating-point reult. Becaue of the imilarity in the tructure of the two proceing tep and the extra memory demand poed on the ytem by including both of them, the follow-on deign jut focu on the filtering part of the application. A implitic approach of improving the performance that follow from the previou filter deign i created by extending the architecture to proce more pixel in parallel, in effect filling up the unued area on the chip. The problem that hinder the performance gain i the et of input parameter that are required to proce more pixel. Increaing the number of concurrent pixel further increae the memory I/O demand. A a reult, extra tall cycle are needed to read the input value. Thee tall cycle, which are a major overhead, become a dominant part of the computation time and quickly aturate the performance of the entire ytem. Thi iue can be clearly undertood by taking a cloer look at the operational pipeline from the Handel-C code: Main loop of application take 17 cycle of execution for 4 pixel of which jut 7 cycle perform actual computation. Therefore, CC for execution of: 4 pixel pipeline: = (7 + 10) (256/2) 8 pixel pipeline: = (7 + 20) (256/4) 16 pixel pipeline: = (7 + 40) (256/8) 256 2n Hence for 4n replication of pipeline: ( n ) = f ( n) 14 4n Slope of the curve = f ( n) = 2

51 41 The ame information i alo conveyed in graphical form in Figure 4.5. A expected, the reource requirement increae linearly with the pixel count but the performance doe not. Thee limitation need to be circumvented in the next deign to further improve the performance by a better pipeline which ha le tall cycle, in effect expoing more parallelim and hiding the input latency. There are two memory bank that were not utilized in the current deign which could be employed to increae the memory bandwidth. Performance Analyi Time (m) Speedup # of pixel proceed in parallel 0 Execution Time Speedup V Proceor FPGA Reource Utilization: Slice occupied 4 pixel 17% 8 pixel 31% 16 pixel 64% Figure 4.5 Performance improvement and change in reource requirement with increae in concurrent computation. 4.2 Deign Architecture #2 Thi deign provide the FPGA-baed engine with a higher memory bandwidth by uing all four on-board memory bank (i.e = 128 bit per CC) in an attempt to eliminate the reource hazard preent in the previou deign. The filter parameter are now evenly pread acro all the bank and hence a imultaneou read of all memory port provide all the needed input for proceing a ingle pixel. The available data

52 42 parallelim i again exploited by pipelining the proceing of independent pixel. Without the exitence of tall, the pipeline produce a ingle etimate every clock cycle. The contraint in the architecture come from the fact that even all the memory bank together can only upport the input for one pixel calculation. Simultaneou operation on multiple pixel require ome tall cycle to be introduced in the deign again. Figure 4.6. Block diagram of the architecture of deign #2. Timing experiment were performed with the ame et of data and compared againt the proceing time for C code running on a Xeon proceor. Thee reult are preented in Table 4.2 along with reource conumption information. Table 4.2. Comparion of the execution time on the FPGA with Xeon proceor and reource requirement of the hardware configuration for deign #2. Execution time on Single cale ( ) Multiple cale (till 4 4) RC m 20.5m 2.4 GHz Xeon Proceor 7.39m 9.86m Speedup Reource Utilization: Slice : 963 out of (5%) Memory : 1MB approx. (filter parameter) DMA latency for ending the data over the PCI bu: 1 cale (650KB) : 10.2m All cale (870KB approx) : 11.4m * the amount of data that need to be tranferred i lightly larger becaue of the architecture detail

53 43 The value how a minor improvement over the previou deign with a peedup of two. Thi improvement i attained with jut one pixel being proceed by the pipeline, a a reult of which the logic area required by the deign come down to jut 5% of the chip. The amount of memory ued by the deign increaed lightly becaue ome zero padding i required in the higher order bit of the third and fourth bank that are left unued. An attempt i made to further improve the performance by introducing more pixel in the ame pipeline. However, thi require introduction of one tall cycle for every extra pixel data read in the input tage of the pipeline. Again, a reource hazard exit due to the memory acce, which eventually aturate the ytem performance with concurrent computation of a certain number of pixel. A mathematical analyi imilar to the previou deign could be employed to prove thi fact a well. A cloer look at the operational pipeline reveal that each etimate actually take 3 cycle for computation (extra cycle are required becaue of a conflict at the fourth memory bank that erve a both input and output for the filtered etimate) and the analyi follow: Main loop of application take 3 cycle for execution Therefore CC for execution 1 pixel pipeline: pixel pipeline: = 4 (256/2) Hence, for n+1 pixel pipeline will require: 256 ( 3 + n ) = f ( n) n Slope of the graph can be found by the derivative, f ( n) = 2 1+ n Since the lope i not contant, the peedup i ub-linear and flatten off quadratically.

54 44 Figure 4.7 depict thi ame information by mean of a graph drawn from the experimental data. Thee tall impede a linear peedup with the number of pixel. Thee two deign alo re-iterate the ubtle trade-off between the memory requirement and the logic area requirement. All of thee tall cycle can be avoided by reducing the memory demand and computing all of the parameter online, but will require more reource. Thi deign alo emphaized the criticality of the memory bandwidth. The next deign will attempt to emulate a higher bandwidth ytem in order to overcome thi limitation. Time (m) Performance Analyi Speedup FPGA Reource Utilization: Pixel proceed concurrently Slice occupied 1 5% 2 11% Number of pipeline Execution time on RC1000 Speedup over Xeon Figure 4.7. Performance improvement and change in reource requirement with increae in number of concurrently proceed pixel. 4.3 Deign Architecture #3 Thi deign ue the on-chip block RAM (BRAM) reource to buffer the input and then work with thee buffer for providing input to the pipeline. Treating BRAM a buffer allow emulation of a more advanced RC ytem which i rich in memory bandwidth. A number of uch buffer could be employed to have multiple non-talling pipeline and hence increae the concurrent computation. The architecture for thi deign i depicted in Figure 4.8. The exitence of input buffer mean the deign incur

55 45 ome additional latency, becaue of the cycle required to fill the buffer. Since there i limited on-chip torage which i much le than the total amount of input data, multiple iteration are required to proce the entire data et. The goal of thi hardware deign i to etimate the performance of a much more advanced RC ytem which i done by eparating the buffering time from the actual computation time. Thee individual component of the total time have been tabulated in Table 4.3 and 4.4. Figure 4.8. Block diagram of the architecture of the deign #3. The deign were developed for a varying number of pipeline (one, two and four) and the timing experiment were performed for all thee deign. The component of the total time were oberved by maintaining a timer on the hot proceor and ending ignal from the FPGA for reading the timer value, at the completion of different proceing tage. Since each deign operated on a different number of pixel, the time for all the deign were extrapolated (wherever needed) for proceing of eight row of input image to have a common comparion bai. Each input buffer ha a capacity of 1KB and could therefore hold 512 value in the 8.8 fixed-point format.

56 46 Table 4.3. Component of the execution time on an FPGA for proceing eight row of input image uing the deign #3. Time (µ) : With 1 pipeline With 2 pipeline With 4 pipeline Input DMA (for entire image) Tranfer into BRAM Computation Tranfer output from BRAM Output DMA (for entire output image 128x128) Thi tructure allowed each buffer to tore data for proceing of exactly two input row of the high-reolution input image ( ). The time lited in Table 4.3 a the tranfer time in and out of the BRAM repreent the overhead incurred becaue of the buffering and i almot the ame for all the deign for a common bai of comparion (8 input image row for our experiment). The deign differ in the amount of computation performed concurrently which i viible in the row marked a computation time. Table 4.4. Component of the total execution time on FPGA for proceing a ingle cale of input data with different level of concurrent proceing, and the reource utilization. Time (µ) : with 1 pipeline with 2 pipeline with 4 pipeline Input DMA (for entire image) Output DMA (for entire output image ) Computation Tranfer time to and from BRAM Total time (including tranfer to/from BRAM) Execution on 2.4 GHz Xeon proceor FPGA Reource Slice occupied BRAM ued 1 pipeline 5% 8% 2 pipeline 11% 17% 4 pipeline 22% 35%

57 47 With a non-talling pipeline a linear peedup i achieved by adding more pixel in the proceing pipeline. Hence the performance of a ytem with ufficient memory bandwidth can be etimated by dicarding the buffering overhead. Thee value are noted in Table 4.4 which compare the total execution time for proceing of a ingle cale ( ) on the RC1000 veru the hot proceor. The value lited in Table 4.4 provide a good appraial of the performance improvement attainable on a high memory bandwidth ytem. The reource conumption by the different deign i alo lited. In addition to imilar logic area requirement, thi deign alo ue block RAM heavily. It i evident that the deign can be extended to atify the reource requirement for concurrent proceing of about 10 pixel. The peedup hence obtained over the Xeon proceor (neglecting the buffering time) i hown in the graph in Figure 4.9 where the value have been extrapolated for the eightpixel deign. The value are preented by both including and excluding the buffering time. Performance Comparion Speedup # of Proceing Pipeline With buffering time Without buffering time Figure 4.9. Improvement in the performance with increae in concurrent computation.

48 The reult including the buffering time degenerate to the cae of deign #2 and a a reult the performance improvement obtained by increaing the amount of concurrent computation tart tapering off

58 48 The reult including the buffering time degenerate to the cae of deign #2 and a a reult the performance improvement obtained by increaing the amount of concurrent computation tart tapering off quickly. But, when the buffering time are dicarded, a linear peedup i obtained by adding more computation. The lat bar (with 8 pixel) in the graph how that more than an order of magnitude improvement i attainable for good deign and ytem with an appropriate amount of reource. Figure 4.10 preent the error tatitic by comparing the output generated by the FPGA for thi deign with Matlab-computed etimate. The mean quare error ha a atifactory low value of 0.01 with the percentage error being le than 2% in mot cae. Error Statitic: Maximum abolute error from Matlab: MSE: Figure Error tatitic for the output obtained after ingle cale of filtering. 4.4 Performance Projection on Other Sytem Thi ubection gauge the performance attainable on ome of the other exiting RC platform. Thee projection are very implitic in nature and are derived baed on the ytem demand poed by the application, a identified from the previou experiment and the reource available on thee RC platform. Thi projection i ueful in appraiing

59 49 the true performance advantage that can be obtained from an advanced RC-enabled remote ening ytem Nallatech BenNUEY Motherboard (with BenBLUE-II daughter card) Figure 4.11 how a high-level architecture diagram of the board. It conit of three Xilinx FPGA, each one of them being a VirtexII 6000 (peed grade -4). Thee advanced chip have larger logic reource, more block RAM and additional pecialized hardware module uch a dedicated multiplier. The VirtexII erie alo provide much fater interconnect than the VirtexE erie. All thee factor account for about a 2x to 4x improvement in the deign clock frequency in moving from VirtexE FPGA to a VirtexII for mot deign. The ytem alo upport a higher memory capacity providing 12MB of torage with a bandwidth of 192 bit every clock cycle. Almot a twofold increae in the memory bandwidth hould yield a correponding linear peedup of about 2x. ZBT SSRAM (2 MB) ZBT SSRAM (2 MB) ZBT SSRAM (4 MB) ZBT SSRAM (4 MB) 32 Data Addr & Ctrl 32 Data Addr & Ctrl 64 Data Addr & Ctrl 64 Data Addr & Ctrl 64 bit /66 MHz PCI Connector PCI FPGA (Xilinx Spartan 2) PCI COMMS bu (32-bit data 40 MHz) BenNUEY Uer FPGA (Xilinx Virtex2 6000, -4) Local Bu (64-bit data, 66 MHz) Inter-FPGA communication bu BenBLUE-II Primary FPGA (Xilinx Virtex2 6000, -4) (159 I/O, uer defined clk) BenBLUE-II Secondary FPGA (Xilinx Virtex2 6000, -4) Figure Block diagram for the hardware architecture of the Nallatech BenNUEY board with a BenBLUE-II extenion card. Courtey: BenNUEY Reference Guide [34].

50 Hence, thi target ytem could provide a performance which i approximately 4x 8x better when compared to the RC1000 board. 4.4.2 Honeywell Reconfigurable Space Computer (HRSC) Figure 4.

60 50 Hence, thi target ytem could provide a performance which i approximately 4x 8x better when compared to the RC1000 board Honeywell Reconfigurable Space Computer (HRSC) Figure 4.12 how a high-level block diagram of the HRSC board. It i compried of four Xilinx FPGA, two of them being Virtex-1000 and the other two from the VirtexII erie (VirtexII 2000). The preence of a larger amount of reource and fater chip hould lead to higher frequency deign, leading to an improvement of approximately 2x over the RC1000 deign. The ytem i alo very rich and flexible in the amount of memory reource and the way the memory i interconnected to all the FPGA. All the memorie are dual-ported and provide more flexibility in input buffering a well a providing higher bandwidth. PMC Slot A Front Panel cpci J1,J2 Front Panel PMC Slot B FrontPanel Uer I/O cpci_if PLX 9054 FrontPanel Uer I/O PMCA_IF PLX 9054 PMC Local Bu VirtexII 2000 VirtexII 2000 PMC Local Bu PMCB_IF PLX 9054 MemBu MemBu DPM SB3 256x32 DPM SB2 256x32 DPM SB1 256x32 cpci Local Bu DPM SB1 256x32 DPM SB2 256x32 DPM SB3 256x32 MemBu MemBu MemBu MemBu MemBu MemBu PE1 Virtex 1000 Config. Cache Configuration, Uer I/O, and interrupt Configuration Manager Virtex PE2 Virtex 1000 MemBu DPM 256x32 MemBu Figure Block diagram of the hardware architecture of HRSC board [29].

Laboratory Exercise 6

Laboratory Exercise 6 Laboratory Exercie 6 Adder, Subtractor, and Multiplier The purpoe of thi exercie i to examine arithmetic circuit that add, ubtract, and multiply number. Each type of circuit will be implemented in two