GPU Acceleration of SAR/ISAR Imaging Algorithms

Size: px

Start display at page:

Download "GPU Acceleration of SAR/ISAR Imaging Algorithms"

Pierce Jennings
6 years ago
Views:

1 GPU Acceleration of SAR/ISAR Imaging Algorithms Gary Rubin Earl V. Sager, Ph.D. David H. Berger, Ph.D. ABSTRACT General Purpose Graphical Processor Units (GPGPUs) provide increased processing capability for applications with a high degree of data parallelism. In the past the few years, GPGPUs have become readily available in the commercial market, and off-the-shelf programming tools (e.g. CUDA from the NVIDIA Corporation and Jacket from Accelereyes, LLC) have made them more accessible to the technical community. SAR and ISAR imaging algorithms are inherently computationally intensive. In order to overcome performance limitations of CPUs and traditional DSPs, simplified, computationally-efficient algorithms are often used, but at the expense of the phase information available within the raw data. We have demonstrated that GPGPU acceleration of SAR/ISAR processing has greatly improved processing times of a less-efficient (but more flexible) algorithm, making its use more practical. We have shown that GPGPUs can provide performance improvement in excess of 30X for a backprojection-based SAR/ISAR imaging technique. Keywords: Algorithms, Computations, Data Processing, Imaging, Inverse SAR, Radar, Signal Processing, Synthetic Aperture Radar 1.0 Introduction For decades, Synthetic Aperture Radar (SAR) and Inverse SAR (ISAR) imaging techniques have been used to represent radar data in a way that is meaningful to a human analyst. The calculations necessary for these imaging techniques are computationally complex, and since the advent of SAR and ISAR imaging, computational throughput has always been a critical factor in the development, implementation, and use of processing algorithms. With the steady advance of computational power provided by CPUs, vector processors, and DSPs, imaging routines have become faster and faster over the years. Chip makers, however, have reached a clock-speed plateau, as processors with increasingly high transistor densities are no longer able to dissipate the heat associated with increasingly high clock speeds. This heat dissipation challenge has led to the recent trend away from ever-increasing clock speeds and instead has resulted in an explosion of multi-core, commercial-grade processors. Multicore processors offer the potential for extremely high throughput, with the ability to achieve such results depending greatly on the nature of the computations being performed. This paper describes recent efforts to use massively-parallel commercial GPUs to accelerate a Backprojection Algorithm (BPA)-based imaging routine. The paper will begin with a discussion of the imaging algorithms, followed by a description of the acceleration process and results. 2.0 Imaging Algorithms Over the years, radar scientists and engineers have developed a wide variety of imaging algorithms and implementations. These include Range Doppler (RDA), Polar Format (PFA), Chirp Scaling (CSA), Range Migration (RMA), and Backprojection or Time-Domain Correlation (BPA or TDC). Each algorithm has its own strengths and weaknesses, and choosing an optimal imaging algorithm typically depends on radar parameters and mission requirements [1,2].

y pixels For example, RDA provides a good balance between accuracy and efficiency, but at the expense of bandwidth and aperture length [2].

PFA can operate on de-chirped data, but may introduce geometric distortion [1]. This paper will primarily focus on an implementation of BPA, as described by [3].

BPA can be quite slow, but does exhibit a high degree of data parallelism, defined as simultaneous operations across large sets of data, rather than from multiple threads of control [4]. 2.

2 y pixels For example, RDA provides a good balance between accuracy and efficiency, but at the expense of bandwidth and aperture length [2]. CSA is computationally efficient, but can limit scene size and image resolution. It also operates only on radar data that has not been de-chirped [1]. PFA can operate on de-chirped data, but may introduce geometric distortion [1]. This paper will primarily focus on an implementation of BPA, as described by [3]. BPA has the advantage of being able to image to an arbitrary surface, can provide phase and amplitude history for an image pixel, and is easily applicable to both SAR and ISAR imaging. BPA can be quite slow, but does exhibit a high degree of data parallelism, defined as simultaneous operations across large sets of data, rather than from multiple threads of control [4]. 2.1 Backprojection Algorithm Implementation We have tested our imaging routines using an ISAR data set collected by an SPC MkV instrumentation radar in The MkV used a 256-step chirp from 8-12 GHz to measure a Saab 9000 hatchback as the Saab was rotated on a turntable. A 360-degree image of the SAAB 9000 is shown in Figure image grid can be either planar or non-planar (Figure 3). 3. For each burst, calculate slant range between each pixel and the radar location. The white traces in Figure 2 represent two examples of pixel slant-range vs. burst number. 4. For each radar burst, use the pixel ranges calculated in step 4 to assign radar range cells to pixels. 5. For each pixel, coherently sum each pulse s signal contribution. Cross-range motion is provided either by the radar motion (SAR) or the target motion (ISAR). In our implementation, the image grid is held fixed, while the radar position is imagined to rotate around the center of the image grid Figure 2. High-range-resolution (HRR) vs. burst number for car ISAR data. This figure represents zero-padded step-chirp data for 180 degrees of rotation. Solid ( x in Figure 3) and dashed traces ( + in Figure 3) represent slant-range profiles for the pixels identified in Figure x pixels Figure 1. ISAR image of SAAB 9000 hatchback. Data represents 256-step 8-12 GHz chirp and 360 degrees of rotation. Pixel resolution is 1cm. The image was created using our BPA implementation. Our implementation of BPA is similar to that described by [3] and comprises the following steps: 1. Perform a downrange DFT on the radar data to obtain a range (fast time) v. position (slow time) data array, phase corrected to a reference range (Figure 2). 2. Form an image grid that defines the spatial location of each pixel relative to the radar. This Figure 3. Image Grid. Two arbitrary pixels are highlighted by the red + and magenta x. The black dot represents the ISAR center of rotation.

3 Target rotation causes the red and magenta pixels to trace the lines shown in Figure 2. In our implementation, Step 1 is a data-parallel operation across the multiple radar bursts, while Step 5 is a dataparallel operation performed inside the Step 4 for loop. 3.0 Code Acceleration There are two processes associated with GPU acceleration. First, the code must be written in a way that operations are highly data-parallel. For The Mathworks MATLAB, this requires that the code be vectorized (see Section 3.1). Second, once the algorithm has been implemented in a data-parallel manner, it can be targeted to the GPU, as described in Section Code Vectorization The BPA algorithm described above has been implemented in MATLAB R2010a. Initially, the code was translated from FORTRAN to MATLAB and relied heavily on nested for loops. The code was then largely rewritten using the MATLAB art of vectorization. In MATLAB, vectorization refers to taking advantage of polymorphism, a compiler feature that allows the same line of code to apply to scalars, vectors, or matrices. MATLAB can perform these vectorized calculations much more efficiently than loops and automatically multithreads some operations [5]. Depending on the nature of the calculations, vectorization may involve a trade-off between CPU efficiency and memory usage. Memory limitations may therefore prevent some vectorization. A simple example of vectorization is as follows. Define two random data vectors A=rand(10000,1); B=rand(10000,1); Using a 2.67 GHz Intel Core i7-920, the vectorized implementation executed in roughly 40% of the time of the looped expression. Another very useful vectorization function in MATLAB is bsxfun, which allows for efficient matrix-vector arithmetic. Consider the following example: Generate a random 2000x1000-element matrix. A=rand(2000,1000); Preallocate an output vector. For each of the 2000 rows, calculate and subtract the mean row value from each element in the row. Some vectorized calculations are used here as well, as CurrentRow is a vector and mean(currentrow) is a scalar. NuA1=zeros(size(A)); for indx=1:size(a,2) CurrentRow=A(:,indx); NuA1(:,indx)=CurrentRowmean(CurrentRow); end Perform the same operation using the matrix implementation of mean and bsxfun. Here, MeanA is a vector, while A is a matrix. MeanA=mean(A,1); NuA2=bsxfun(@minus,A,MeanA); In this case, the bsxfun implementation runs approximately 4x faster than the loop iteration on 2.67 GHz Intel Core i GPU Implementation Over the past several years, graphics processing unit (GPU) technology has experienced dramatic growth in terms of computational performance (Figure 4). Preallocate an output vector C, then perform the operation using a loop. C=zeros(10000,1); for indx=1:length(a) C(indx)=A(indx)*(B(indx)^2); end Perform the same calculation as a vector operation. The.* and.^ operators refer to element-by-element vector operations. C=A.*(B.^2); Figure 4. Growth in NVIDIA GPU performance vs. CPU performance. Solid lines are single-precision

4 Runtime (s) GFlops/sec; dashed lines are double-precision GFlops/sec [6]. AMD/ATI and NVIDIA are leaders in the GPU market, and both support non-graphics general-purpose GPU (GPGPU) applications. We have chosen to use NVIDIA GPUs due primarily to the maturity and community support of their CUDA development environment. To reduce schedule risk for the GPU acceleration effort described in this paper, we decided to avoid writing our own CUDA code. Instead, improvements to runtimes were achieved through the use of Accelereyes, LLC s Jacket software platform. Jacket serves as nearlytransparent middleware, allowing execution of MATLAB code on CUDA-capable NVIDIA GPUs directly from the MATLAB development environment. Jacket achieves this by overloading most base MATLAB functions. When these functions are called using special Jacket data classes, Jacket builds an internal representation of the program being run, compiles that representation if necessary, performs the computation on the GPU, and makes the results available to MATLAB if requested (leaving data GPU-resident as long as possible). Because Accelereyes has written Jacket to work with existing MATLAB syntax, parallelizing operations for GPU use is essentially identical to the MATLAB vectorization described in Section Acceleration Results Benchmarking was performed using the system described in Table 1. Table 1. Benchmark CPU CPU Intel Core 2.67 GHz Motherboard EVGA X58 SLI Memory 12 GB DDR OS (dual boot) -Windows 7 Professional 64-bit -CentOS 5.5 GPU 1 NVIDIA C1060 w/ 4 GB GDDR3 (~$1300) GPU 2 NVIDIA GeForce 9800 GT w/ 512 MB MATLAB R2010a Version Jacket Version 1.3 The Core i7-920 CPU provides four processing cores, each with two processing threads. Of these eight available processing threads, two are typically used during CPU benchmarking. The CPU resources could be applied more efficiently by using MATLAB s Parallel Computing Toolbox to spread the burst for-loop iterations across the multiple threads. Similarly, the Parallel Computing Toolbox can be used in conjunction with Jacket to spread the processing among multiple GPUs. For the imaging performance benchmarks, the dataset described in Table 2 and shown in Error! Reference source not found.figure 1 was used. Table 2. Benchmark Dataset Collection System SPC MkV radar Collection Mode ISAR Waveform Type Step-chirp Start Frequency (GHz) 8 Stop Frequency (GHz) 12 Chirp Bandwidth (GHz) 4 Frequency Steps 256 Angle Start (rad) Angle Stop (rad) Angle Step (rad) Range (m) 100 Subject Saab 9000 hatchback It is understood that the pixel resolutions used for benchmarking are much higher than the resolution supported by the actual data. While these resolutions may be artificially high for this particular dataset, they were used to demonstrate computational performance for large image sizes of the type that might be used for airborne or spaceborne SAR. Because the ISAR BPA implementation is virtually identical to the SAR BPA implementation, we believe that it is valid to use an ISAR dataset to demonstrate image sizes that are more typical of SAR. Execution times for the Core i7-920 CPU-only BPA ISAR imaging algorithm are shown in Figure 5. The C1060 GPU-enabled runtimes are shown in Figure x Figure 5. Runtimes for BPA ISAR imaging of Saab 9000 using Core i7-920 CPU under Win7 Pro 64-bit. Burst counts correspond to 10, 60, 120, 180, and 240- degree sectors.

5 Speedup (CPU Time / GPU Time) Runtime (s) Figure 6. Runtimes for BPA ISAR imaging of Saab 9000 using NVIDIA C1060 GPGPU under Win7 Pro 64-bit. Burst counts correspond to 10, 60, 120, 180, and 240-degree sectors. Speedup is defined as. Figure 7 shows speedup for the runtimes shown in Figure 5 and Figure 6. We believe that the sharp decrease in speedup after 15 megapixels is due to a memory efficiency threshold associated with the larger data arrays Figure 7. GPU Speedups; C1060 vs. Core i7-920 under Win7 Pro 64-bit 4.0 Related Work In addition to the imaging acceleration described in this paper, SPC has also demonstrated GPU acceleration for surface-surveillance radar clutter reduction. For that processing, we were able to demonstrate speedups of roughly 10x vs. the Core i7-920 and roughly 5x vs. a realtime DSP implementation. SPC has also begun the process of performing GPU acceleration of PFA-based imaging routines. This process was still in progress at the time of publication of this paper. 5.0 Summary We have demonstrated that highly-parallel, computationally-complex tasks, such as those associated with BPA SAR/ISAR imaging, can be greatly accelerated through the use of GPUs. We have demonstrated improvements in BPA runtime in excess of 30x, meaning that the GPU allows processing that might take an entire workweek on a standard desktop PC to be completed in a little over an hour. Such runtime improvements increase the practicality of BPA as a large-scale imaging routine. The speedups presented in this paper should not be seen as an upper limit. It is very likely that additional speed improvements could be realized by further optimization of the BPA code. It is also anticipated that GPU performance will be further improved as CUDA and Jacket evolve and are enhanced. 6.0 References [1] Carrera, W.G., Goodman, R.S., and Majewski, R.M., Spotlight Synthetic Aperture: Radar Signal Processing Algorithms, Norwood, MA: Artech House, 1995 [2] Cumming, I.G., and Wong, F.H., Digital Processing of Synthetic Aperture Radar Data, Norwood, MA: Artech House, 2005 [3] Soumekh, M., Synthetic Aperture Radar Signal Processing with MATLAB Algorithms, New York: John Wiley & Sons, 1999 [4] Hillis, W.D., and Steele, G. L., Data Parallel Algorithms, Communications of the ACM 29, 12 (Dec. 1986), pp [5] Which MATLAB functions benefit from multithreaded computation?, MATLAB Technical Solution, [6] Source: NVIDIA via personal correspondence 7.0 Acknowledgments Thanks to Gallagher Pryor, Dave Gibson, and others at Accelereyes, LLC for their inputs and technical advice.

high performance medical reconstruction using stream programming paradigms

high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming