DUE to the imaging characteristics of day-and-night and

Size: px

Start display at page:

Download "DUE to the imaging characteristics of day-and-night and"

Britton Pitts
5 years ago
Views:

1 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 10, NO. 2, FEBRUARY A Deep Collaborative Computing Based SAR Raw Data Simulation on Multiple CPU/GPU Platform Fan Zhang, Member, IEEE,ChenHu, Student Member, IEEE, WeiLi, Member, IEEE, Wei Hu, Pengbo Wang, and Heng-Chao Li, Senior Member, IEEE Abstract The outstanding computing ability of a graphics processing unit (GPU) brings new vitality to the typical computing intensive issue, so does the synthetic aperture radar (SAR) raw data simulation, which is a fundamental problem in SAR system design and imaging research. However, the computing power of a CPU was underestimated, and the tunings for a CPU-based method were missing in the previous works. Meanwhile, the collaborative computing of multiple CPUs/GPUs was not exploited thoroughly. In this paper, we propose a deep multiple CPU/GPU collaborative computing framework for time-domain SAR raw data simulation, which not only introduces the advanced vector extension (AVX) method to improve the computing efficiency of a multicore single instruction multiple data CPU, but also achieves a satisfactory speedup in the CPU/GPU collaborative simulation by fine-grained task partitioning and scheduling. In addition, an irregular reduction based SAR coherent accumulation approach is proposed to eliminate the memory access conflict, which is the most difficult issue in the GPU-based raw data simulation. Experimental results show that the multicore vector extension method greatly improves the computing power of a CPU-based method through about 70 speedup, thereby outperforming the single GPU simulation. Correspondingly, compared with the baseline sequential CPU approach, the multiple CPU/GPU collaborative simulation achieves up to 250 speedup. Furthermore, the irregular reduction based atomic-free optimization boosts the performance of the single GPU method by 20% acceleration. These results prove that the deep multiple CPU/GPU collaborative method is promising, especially for the case of huge volume raw data simulation with a wide swath and high resolution. Index Terms Advanced vector extensions (AVX), collaborative simulation, graphics processing unit (GPU), raw data generation, synthetic aperture radar (SAR). Manuscript received February 21, 2016; revised May 29, 2016 and July 5, 2016; accepted July 21, Date of publication October 27, 2016; date of current version January 23, This work was supported in part by the National Natural Science Foundation of China under Grant and Grant , in part by the Beijing Natural Science Foundation under Grant , and in part by the Fundamental Research Funds for the Central Universities under Grant BUCTRC and Grant BUCTRC (Corresponding author: Fan Zhang.) F. Zhang, W. Li, and W. Hu are with the College of Information Science & Technology, Beijing University of Chemical Technology, Beijing , China ( zhangf@mail.buct.edu.cn; leewei36@gmail.com; huwei@mail.buct.edu.cn). C. Hu is with the College of Information Science & Technology, Beijing University of Chemical Technology, Beijing , China ( huchen buct@ 163.com). P. Wang is with the School of Electronic and Information Engineering, Beihang University, Beijing , China ( wangpb7966@163.com). H.-C. Li is with the Sichuan Provincial Key Laboratory of Information Coding and Transmission, Southwest Jiaotong University, Chengdu , China ( lihengchao_78@163.com). Color versions of one or more of the figures in this paper are available online at Digital Object Identifier /JSTARS I. INTRODUCTION DUE to the imaging characteristics of day-and-night and weather-independent, synthetic aperture radar (SAR) has been widely used for Earth remote sensing for more than 30 years, and has come to play a significant role in geographical survey, climate change research, environment and Earth system monitoring, multi-dimensional mapping and other applications [1]. With the evolution of SAR technologies, SAR information acquisitions with high resolution, wide swath, multiple modes, and multiple dimensions, have been possible. In the foreseeable future, more advanced multiplatform, multimode, multiband, multipolarization SAR systems will be developed to satisfy the emerging requirements. SAR flight experiments being time consuming and highly expensive, computer simulations are often applied to assist the key technology research, system design, system development, and even the data applications. To quantitatively support the design of an SAR operating in the hybrid mode, to help mission planning, and to test processing algorithms, a SAR raw signal simulator is required, especially when real raw data are not available yet [2]. Therefore, it would be highly desirable that the accurate SAR raw data are quickly simulated to do such research work for the new generation of SAR sensors. From the perspective of simulation processes, SAR raw data simulations can be divided mainly into two classes: SAR oriented and SAR processing oriented [3]. The SAR oriented algorithms, including the time-domain algorithm [4], [5], FFT-based time-domain algorithm [2], and two-dimensional (2-D) frequency-domain algorithm [6], simulate the physical process of microwave transmission and reception, and then calculate the SAR raw data. Comparatively, the SAR processing oriented algorithms, including the inverse fourth-order extended exact transfer function algorithm [7], inverse ω κ algorithm [6], [8], inverse Chirp Scaling algorithm [9], and inverse frequency scaling algorithm [10], simulate the SAR raw data based on the inversion of SAR imaging processing. The 2-D frequency-domain algorithm and the inverse processing algorithm are efficient, and are always employed for the scene simulation of a stripmap [6], [11], [12], spotlight [13], hybrid mode [2], etc. But, the 2-D frequency-domain algorithm is difficult to consider the actual system errors. To account for these errors, some assumptions need to be introduced into these algorithms, resulting in some application limitations, such as narrow beam and slow deviation [14] [16]. On the other hand, the time-domain algorithm is highly time consuming, but is IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See standards/publications/rights/index.html for more information.

2 388 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 10, NO. 2, FEBRUARY 2017 able to exactly account for these system errors. Therefore, if we are interested in a raw signal simulation from a geometrically limited target where the platform is liable to mechanical oscillations, orbital deviations, etc., the time-domain approach is a good choice [5]. But, its high computational complexity allows one to deal only with an imaged scene consisting of one or a few point scatterers [4]. In order to extend its application to scene simulation, the optimization of efficiency should be further studied. For time-domain raw data simulation, the optimization of algorithm under the precondition of keeping precision intact does not achieve satisfactory performance. Due to the independence of return signal gathering, parallelization is the most straightforward idea, and can significantly shorten the simulation time with state-of-the-art high performance computing (HPC) technologies. In order to boost the efficiency of SAR raw data simulation, two categories of HPC methods have been employed: CPU oriented method and graphics processing unit (GPU) oriented method. The CPU oriented method mainly indicates the raw data parallel simulation on a CPU platform, such as open multiple processing (OpenMP) with multicores [17], a message passing interface (MPI) with multi-cpus [18], and grid computing with multicomputers [19]. The accelerating effects of these methods are proportional to the number of CPUs, which induces a high parallel simulation cost. A GPU oriented method realizes the parallel simulation with massive cores on a GPU platform [20] [23]. Compared with the CPU method, the GPU method achieves an acceleration of a dozen to several hundred times, and proves to be a high-efficiency, low-cost solution for raw data simulation. In our previous work [23], parallel simulation in one GPU achieves around 100 speedup over the CPU-based sequential algorithm. However, the CPU, as one kind of computing resource, was ignored in the GPU-based SAR raw data simulation. Typically, the CPU cores remain idle while the GPU cores are busy in computing. The heterogeneous CPU/GPU computing seems to be an optimal solution to further improve the simulation efficiency in that there is a growing trend toward the institutional use of multiple computing resources (usually heterogeneous) as a sole computing resource [24]. The heterogeneous simulation is implemented with hybrid multicore CPU parallel and massive cores parallel on a GPU, temporarily called Collaborative Computing 1.0 (CC 1.0) [25]. Although the CPU/GPU collaborative computing is employed in some computing-intensive cases [26] [30], it is not applicable for SAR raw data simulation due to the broad gap in performance between the CPU and GPU methods. Even the CPU is exploited to assist GPU computing, the effect is like a drop in a bucket. Recently, the single instruction multiple data (SIMD) extensions instructions, namely streaming SIMD extensions (SSE) and advanced vector extensions (AVX), have been introduced to further exploit the computing capability of a CPU. The SSE instructions method has been applied to classical topics, such as fast Fourier transform (FFT) [31], finite-difference time domain (FDTD) [32], image processing [33], [34], and graphics application [35], and achieved 4 to 73 speedups with multicore SIMD CPUs. Even with the SSE instructions, the acceleration is almost of the same level as that of a GPU parallel. More speedups can be expected if the SSE is expanded to AVX instructions. On the other hand, Hwu believes that the multicore CPU SIMD parallelism is an important trend in parallel computing, and can be effectively covered by multicore architectures [36]. Inspired by this research, we try to develop a deep collaborative computing method to make the AVX-based multicore CPU and the compute unified device architecture (CUDA) based massive core GPU work together to simulate the SAR raw data faster, called Collaborative Computing 2.0 (CC 2.0). To make CC 2.0 more efficient, a deep optimization of the GPU part is absolutely required. As the main body of collaborative computing, the GPU method has still an irregular parallel issue unresolved, namely the coherent accumulation of return signals, which leads to severe memory conflicts and reduces the simulation efficiency. The irregularities of signal accumulation are as follows. For each sampling unit, the distribution and number of target points that contribute to the return signals are changed with the movement of the flight platform, and can only be determined during the calculation of each azimuth sampling time. If these parameters are known and predictable, this coherent accumulation can be simplified to a generalized reduction issue, which is easy to parallelize. However, the coherent accumulation of return signals is an irregular reduction problem, in fact. As for GPU parallel computing, this kind of irregular and unpredictable application is difficult to achieve in high parallelization. In recent years, the GPU parallel algorithm for irregular issues has attracted a lot of attention [37] [40]. Although the specific irregularity of SAR coherent accumulation is different from those discussed in the aforementioned literatures, the idea of irregular reduction based parallel accumulation is feasible. Therefore, the GPU-based irregular reduction algorithm should be restudied to efficiently simulate the massive accumulation of return signals. This is for the first time that collaborative computing is applied to the SAR raw data simulation, which not only deeply exploits the computer resources to accelerate simulation, but also lays the foundation for the next-step implementation of distributed computing and cloud computing. Therefore, we propose a multicore SIMD CPU and multi-gpu deep collaborative computing based SAR raw data simulation method. Compared to our previous works [19] [23], we make the following contributions: 1) We introduce the AVX method to accelerate the multicore CPU-based SAR raw data simulation, and narrow the efficiency gap between CPU and GPU simulation. 2) We propose a multiple CPU/GPU deep collaborative computing framework that exploits the computing resources as much as possible and offers the CPU to handle more computing tasks like a GPU, to improve the whole efficiency of time-domain SAR raw data simulation. 3) We propose an irregular reduction mechanism to eliminate the memory access conflict in the classical GPU and multi- GPUs based raw data simulation, and further improve the computing efficiency. The rest of this paper is organized as follows. Section II briefly introduces the FFT-based time-domain SAR raw data simulation algorithm and the parallel computing mechanism. Section III presents the proposed multiple CPU/GPU deep collaborative computing based raw data simulation algorithm. Then, the

ZHANG et al.: DEEP COLLABORATIVE COMPUTING BASED SAR RAW DATA SIMULATION ON MULTIPLE CPU/GPU PLATFORM 389 experimental results and optimization analysis are discussed in Section IV.

3 ZHANG et al.: DEEP COLLABORATIVE COMPUTING BASED SAR RAW DATA SIMULATION ON MULTIPLE CPU/GPU PLATFORM 389 experimental results and optimization analysis are discussed in Section IV. Finally, conclusions are drawn in Section V. II. TIME-DOMAIN SAR RAW DATA SIMULATION ALGORITHM AND ITS PARALLELIZATION A. SAR Raw Data Simulation Algorithm The echo signal model [41] is applicable for airborne and space-borne SAR data. Assuming that the transmitting signal is a linear frequency modulated signal pulse, given by ( ) τ s t (τ) =s r (τ) exp (jw c τ)=rect exp ( jw c τ + jπk r τ 2) T p (1) where s r (τ) is the chirp signal, w c is the carrier frequency, τ is the range time, T p is the signal pulse width, k r is the chirp rate, and rect( ) is the range window of the pulse. Through the heterodyne receiver, the single point echo in the SAR stripmap mode is expressed as 2-D s i (t, τ) s i (t, τ) = T σ i W a (θ i ) exp n=0 ( j 4 πr i(t n ) λ ( s r τ 2r ) i(t n ) (2) c where i is the order number of a single point in the target matrix, n is the order number in azimuth time, T is the number of azimuth samplings, t is the azimuth time, σ is the scattering coefficient, W a is the azimuth antenna pattern, θ i is the angle between the azimuth phase center and target i, T p is the signal pulse width, r i (t n ) is the distance between target point i and the radar antenna phase center at time t n, λ is the wave length, and c is the speed of light. When the simulation objects are distributed targets, the discrete representation of an SAR echo signal can be obtained by s(t, τ) = M i=0 ) s i (t, τ) (3) where M is the total number of target points. In practical engineering calculations, the FFT-based timedomain method, which calculates the scattering target points accumulation by frequency-domain multiplication, is often applied for raw data generation with s(t, τ) = s a (t n,τ)= = T s a (t n,τ) s r (τ) n=0 T n=0 M i=0 δ f { f [ s a (t n,τ) ] } S r (ξ) σ i W a (θ i ) exp ( τ 2 r ) i(t n ) c ( j 4 πr ) i(t n ) λ (4) (5) Fig. 1. The FFT-based time-domain SAR raw data simulation diagram. where f( ) is the Fourier transform operator, f ( ) is the inverse Fourier transform operator, S r (ξ) is the linear FM Signal spectrum, δ( ) is the Dirac delta function, and indicates the convolution operator. In the procedure of simulation, the linear FM signal spectrum S r (ξ) does not change, while the azimuth signal spectrum changes with different scattering points and azimuth times. According to (4) and Fig. 1, the raw data simulation algorithm includes the following five steps: Step 1: One-dimensional Fourier transform of s r (τ) is performed to yield S r (ξ). Assuming that there is no moving target in the simulated scene, the constant scattering map σ and the azimuth antenna pattern W a are loaded in advance. Step 2: The geometry is calculated to obtain the range r and angle θ. Furthermore, the targets within the footprint are selected by determining if their θ i are smaller than the half of azimuth beam width. Then, s a (t n,τ) is calculated and transformed into the frequency domain to generate S a (t n,ξ). Step 3: The multiplication of S r (ξ) and S a (t n,ξ) is implemented. Step 4: The raw signal is achieved by the inverse Fourier transform of results in Step 3. Step 5: For all the azimuth sampling times, Steps 2 4 are repeated to get the complete simulated raw data. B. Parallelization of Raw Data Simulation According to the stop-and-go model, SAR raw data simulation is a serial time process, and then the coupling of transmitting and receiving pulses at different times is small. Therefore, we can take the transmitting and receiving pulses as the task unit, which will be dispatched to every computation node and calculated quickly by MPI, grid computing, or other parallel technologies. The parallelism of SAR raw data simulation can be divided into a coarse-grained strategy and a fine-grained one, as shown in Fig. 1. The traditional parallel approach belongs to the former, which takes the repetitious transmitting and receiving pulses process as one task. The process completes the task assignment through dispatching a reasonable number of simulated pulses to

4 390 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 10, NO. 2, FEBRUARY 2017 different nodes, CPUs, CPU cores, given by s(t, τ) = m k=0 D k = m k=0 T k +1 n=t k s(t n,τ) (6) in which D k represents the calculation task of node k and m indicates the number of subtasks. Comparatively, a parallel simulation based on a GPU is a fine-grained parallel method, which optimizes the largest-timeconsuming step. The task of every thread is the azimuth signal calculation of a single scattering point and a single sampling point multiplication (as shown in Fig. 1), given by [ T M ] } s(t, τ) = f {f S(ξ) = n=0 T f n=0 N i=1 D (n,i) D (n,j) j=0 where D (n,i) is the azimuth signal of point i in time t n, D (n,j) is the spectrum product of the linear FM signal and azimuth signal at range gate j in time t n, and N is the number of range gates. With parallel task decomposition from coarse-grained D k to fine-grained D (n,i) and D (n,j), a higher efficiency of the parallel simulation is achieved. The SAR raw data fine-grained parallel simulation with CUDA not only accords with the physical process of echo generating, but also takes full advantage of GPU hardware resources and computing power. The method with CUDA is very suitable for large-area SAR raw data simulation. However, there is an obvious issue in fine-grained parallel partitioning, namely the memory access conflict. In each flight sampling time, the signals from different target points at a similar distance will be collected in the same sampling unit, which also calls the range gate. The different targets return signals need to be accumulated in the same memory cell if their range differences do not exceed the length of range gate. Hence, there exists a massive memory access conflict. For coarse-grained parallel simulation, its parallel task D k includes all targets return signals simulation, and is completed by one thread. Consequently, there is no access conflict on the accumulation of single-threaded execution. But, for the fine-grained parallel simulation, the parallel task D (n, i) includes only one target s return signal simulation. The accumulation requires the participation of all threads, yielding the access conflicts. To avoid such problems, the method of a thread synchronization lock has been considered, such as the atomic operation of CUDA. The essence of atomic operation is to ensure a single thread access to resources, while to leave the other threads in a waiting state. So, the parallel computing efficiency is reduced by around several times. The reduction algorithm is an optimal solution for huge data accumulation. The classical GPU-based reduction algorithm has been well optimized through different strategies, and achieves around 30 speedup. But, the return signal accumulation has its irregularities, including the uncertainty of the accumulated location and number, as shown in Fig. 2. These irregularities lead to difficulties in applying the classical reduction algorithm. (7) Fig. 2. The irregularities of SAR coherent accumulation. III. MULTIPLE CPU/GPU DEEP COLLABORATIVE BASED SAR RAW DATA SIMULATION With the rapid development of HPC devices, heterogeneous and collaborative computing are the future trends in solving computing-intensive problems, especially the SAR raw data simulation of large areas. Our previous work focuses only on using one category of computing device to accelerate the simulation, and actually needs to be expanded to multiple computing resources for higher efficiency simulation. Therefore, we mainly discuss the multiple CPU/GPU deep collaborative simulation framework, the improved multicore CPU parallel algorithm, and the improved GPU parallel algorithm, and try to improve the stand-alone computing power for each device, thus boosting the time-domain SAR raw data simulation with collaborative computing. A. Multiple CPU/GPU Deep Collaborative Simulation Framework Due to the gap in the capacity of computing between a CPU and a GPU, the CPU takes on only auxiliary tasks in the traditional CPU/GPU collaborative computing. In this sense, this kind of collaborative computing is actually a GPU-based approach in that the main computing tasks are performed by massive GPU threads. The concept of deep collaborative means that the CPU is not only seen as a manager, but also treated as a worker sharing the tasks of GPU computing. To achieve the deep collaborative computing, the computing capacity of a CPU should be strengthened by OpenMP-based multicore parallel and AVX-based vectorization parallel. After minimizing the gap between a CPU and a GPU in terms of computing power, they can be seen as unified computing devices to undertake the simulation tasks. The whole deep collaborative computing based SAR raw data simulation is shown in Fig. 3. In the deep collaborative simulation framework, there are three main processing steps, namely task partitioning and scheduling on a CPU, collaborative computing on multiple CPUs/GPUs, and data merging on a CPU. Due to the rigorous independence of returned signal simulation in different azimuth times, it can be taken as a task unit to be distributed among different computing resources. As shown in Fig. 3, the whole azimuth time, namely the whole simulation period, is divided

5 ZHANG et al.: DEEP COLLABORATIVE COMPUTING BASED SAR RAW DATA SIMULATION ON MULTIPLE CPU/GPU PLATFORM 391 Fig. 4. Task partitioning strategy. Fig. 3. Multiple CPU/GPU deep collaborative computing for raw data simulation. into five parts, which are dispatched to four GPUs and multiple CPUs. Then, GPUs and CPUs execute their simulation tasks simultaneously. In the GPU simulation part, CUDA and its FFT library are employed to realize the raw signal simulation. In the CPU simulation part, OpenMP is used for multicore task scheduling and implementation, and AVX is applied to each CPU core to do the numerical operations and FFTs in vectorization mode. The result of CPUs are stored in place, and the results of GPUs need to be transferred to CPU memory. After merging GPU results with CPU results, the whole simulated SAR raw data are finally obtained. B. Multiple Tasks Partitioning and Scheduling According to the parallelization analysis of raw data simulation, the coarse-grained parallel and fine-grained parallel are employed by using the CPU method and GPU method, respectively. In our previous work on multi-gpu-based raw data simulation, the hybrid scheduling of coarse-grained and fine-grained parallels is introduced to finish the whole simulation. Specifically, the coarse-grained tasks D k are distributed to different GPU cards, and the fine-grained tasks D (n,i) are assigned to different GPU threads. As for the multiple CPU/GPU collaboration, the AVXbased multicore CPU and GPU almost have the same com- puting power. Therefore, the hybrid scheduling strategy can be exploited here, which takes AVX as another GPU card for task partition and scheduling. The task partitioning and scheduling strategies need to consider not only the limitation of memory size, but also the SAR principle, which indicates that the synthetic aperture effect should be considered in task partition, as showninfig.4,wherena and Nr are the size of raw data in azimuth and range directions, T is the execution time of each subtask, and L a is the synthetic aperture length. Under different simulation conditions, the target dataset can be equally divided inton parts along the flight direction. Subdata are distributed to various CPUs and GPUs simultaneously, and then redistributed after the calculation. According to the SAR principle, the simulated area of Na/n in width can generate raw data of Na/n+ La in width. So, there is an overlap of La width between different calculation results (as illustrated in Fig. 4), which should be additionally considered in signal merging. On the other hand, we can also equally divide the task in raw signal domains and make the overlap of La width between different target data in the target domain. Although the partitioning strategy is a bit complicated, the data merging is relatively simple. On the basis of the task partitioning strategy, the scheduling strategy should be discussed specifically in that the scheduling of the same computing devices and different computing devices is different. The traditional multi-gpus based method is a combination of the serial processing in CPU and the parallel processing in GPU. The serial code functions include the input operation, task partitioning and scheduling, and data transfer between a CPU and a GPU. According to distributed tasks, the multi-gpus simulate the SAR raw signals in a massive parallel mode. Since the computing capacity of GPU cards is similar or computable, the static allocation strategy, namely the one-time task allocation, combined with a parallel pipeline for hiding data transfer, proves to be feasible and efficient. But, the difference between CPU and GPU computing is hard to estimate in multiple CPU/GPU collaborative computing. The static scheduling

6 392 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 10, NO. 2, FEBRUARY 2017 Fig. 5. Fig. 6. Task dynamic scheduling strategy. AVX parallel based signal coherent accumulation simulation. strategy will lead to the idle computing resources and reduce the collaborative efficiency. Therefore, the task should be divided into smaller parts to gradually balance the CPU part and GPU part. The first few blocks are employed to test the elapsed time difference between CPU and GPU computing. After several rounds of dynamic task adjustment, CPUs and GPUs are fully occupied, and the optimal collaborative efficiency is expectable, as shown in Fig. 5. Otherwise, in the calculation of running time, only the parallel simulation time is considered in the CPU part for its in-place operation. But, for the GPU part, in addition to GPU computing, the data transfer and data merging are taken into account. C. Multicore CPU Vectorization Parallel Based Raw Data Simulation Compare to the wide application of a GPU, the computing power of a CPU is usually underestimated. Furthermore, the general CPU parallel application is still stuck in the old way of multicore through OpenMP and multi-cpu through MPI. In order to further exploit the CPU parallel capability, the optimal solution should be the SIMD computing model, which is implemented by the SSE and AVX instructions. The SSE instructions can perform four basic operations by only one instruction, that is, SSE gives us an extra 4 speedup on the basis of multicore acceleration, as shown in Fig. 6. Furthermore, the AVX technologies, supported by the new-generation CPUs of Intel and AMD introduced in 2011 and after, expand the SIMD register from 128 to 256 bits, which means one AVX instruction can operate eight single precision floating point data. Therefore, with the combination of OpenMP-based multicore parallel and Algorithm 1: SAR raw data simulation: AVX-based multicore CPU version 1: for each n [0,T 1] do 2: Multi-core: each thread separately simulates for raw data in each azimuth time t n 3: for each i [0,M 1],i+ =4do 4: SIMD: i 4 indicates the vector operations of four items under one AVX instruction 5: r 4 i (t n) compute the range of each of the four targets 6: fp 4 i (t n) determine whether the four targets are in the footprint 7: rg 4 i (t n) compute the range gate of each of the four targets 8: θ 4 i (t n) compute the return signal phase of each of the four targets 9: end for 10: for each i [0,M 1],i+ =1do 11: Serial code: implement the irregular coherent accumulation 12: s a [rg i (t n )]+ = σ i exp(θ i (t n )) 13: end for 14: 15: SIMD: the following executions are with AVX 16: f(s a ) perform 1-D Fourier transform 17: f(s a ) S r compute the multiplication of two frequency-domain signals 18: f {f(s a ) S r } perform 1-D inverse Fourier transform 19: end for AVX-based vector extension, the SAR raw data simulation can expect a GPU-level speedup by applying AVX on a multicore CPU. The improved CPU approach includes two levels of parallelism: coarse-grained parallel in multicore and fine-grained parallel in vectorization. For the multicore parallel, the task unit is the whole returned signal simulation in each azimuth time, which is a coarse-grained parallelism. For the vectorization parallel, the task unit is a numerical operation in the process of simulation, which is a further processing on multicore parallel. As for the detailed implementation, SAR raw data simulation can be seen as a series of numerical operations and FFTs. The AVX-supported FFT library can be employed for the FFT operations in the simulation. Therefore, the key issue is how to use AVX for vectorization processing of numerical operations. In order to explain the AVX-based numerical operations, the vectorization processing of signal coherent accumulation, which is the most time-consuming part in the whole raw data simulation, is shown in Fig. 5 and Algorithm 1. The task of each OpenMP core is the calculation of returned signal in each azimuth time in which the AVX instructions are performed. Moreover, the vectorization processing requires regular data storage for vector construction. So, the following different vectorization strategies will be applied in each OpenMP core computing operation:

7 ZHANG et al.: DEEP COLLABORATIVE COMPUTING BASED SAR RAW DATA SIMULATION ON MULTIPLE CPU/GPU PLATFORM 393 1) In azimuth signal calculation, vectorization toward target points will be used, namely one SSE/AVX instruction will finish the range history and corresponding complex signal calculations for 2/4 target points. 2) In the step of return signal coherent accumulation, a vector extension method is not applicable. Because the return signals of each of the adjacent four target points may not be received in the continuous range gate, the accumulation will be processed as a sequential execution. 3) In the frequency transformation step, the latest FFTW library with SSE/AVX support is used to accelerate calculation with vector extension. 4) In spectrum multiplication, vectorization toward data dimension is applied, namely one SSE/AVX instruction will finish multiplications for 2/4 data sampling points. D. Improved GPU Parallel Simulation With Irregular Reduction Algorithm As the previous analysis, the irregularities of signals coherent accumulation make it hard for fine-grained parallelization. For a multicore vector extension method, the distribution of target points that contributed to the same range gate is discontinuous, which makes the vector construction difficult. For the GPU method, the uncertainty of target points in the same range gate leads to sever access conflict. Although the classical GPU based SAR raw data simulation is a kind of mature parallel algorithm [20] [23], which has been extensively discussed, the access conflict issue in the GPU-based signal coherent accumulation is still not thoroughly solved. In this section, we mainly discuss the optimization of the access conflict in the GPU method. We have employed three kinds of methods to implement the coherent accumulation of GPU fine-grained parallel. First, our earlier GPU-based simulation articles [20] can achieve a desirable speedup on GPU processing without considering the access conflict in that each range gate places only one target for ideal simulation. In fact, there are many target points gathered in each range gate. There must be a wrong result if there is no sequence control over massive accumulation operations in one range gate. Second, the atomic operation was used to control the order of each thread s accumulation and guarantee the result [21]. The local sequential execution in GPU parallel decreases the parallel efficiency by several times, which brings challenges to the ongoing optimization or additional hardware investment on multi-gpus based SAR raw data simulation. Third, the interpolation-based method [23] was employed to decrease the number of targets in the same range gate, which reduced the number of conflicts. But, it is an inverse processing solution that uses the memory space to gain the computing efficiency. The estimation of interpolation factor is not always optimal. Thus, we try to discuss the forward processing solution of this issue. The most straightforward solution is the classical reduction algorithm. For regular accumulation, the computational complexity reduces from O(n) to O(log 2 (n)) considering the reduction algorithm. So, the reduction algorithm will boost the accumulation compared with the serial execution by atomic operation. Since Algorithm 2: Coherent accumulation: GPU version with irregular reduction 1: for each n [0,T 1] do 2: Multiple GPUs: each GPU separately simulates for raw data in each azimuth time t n 3: for each GPU thread i [0,M 1] do 4: GPU thread: each thread finishes the return signal computation and accumulation of each target i 5: r i (t n ) compute the range of target i 6: fp i (t n ) determine whether target i is in the footprint 7: rg i (t n ) compute the range gate of target i 8: θ i (t n ) compute the return signal phase of target i 9: 10: /*Regularization of irregular accumulation*/ 11: atomicadd(index i, 1) compute the sequence number in the accumulation queue of the same range gate using an integer atomic add 12: S redu [index i ]=σ i exp(θ i (t n )) move the signal result to the reduction space 13: end for 14: 15: /*Regular reduction calculation*/ 16: S redu [0] = Reduction(S redu ) implement GPU based regular reduction on reduction space 17: s a [rg i ]=S redu [0] write back to the raw data space 18: end for the total number and distribution of targets in the same range gate are unknown, they need to be calculated in real time by each GPU thread. Only if the object data and object number are obtained, the reduction algorithm could show its advantages in computing efficiency. Therefore, the initial idea is to divide the original GPU-based coherent accumulation into three parts, including the calculation of range gate and signal phase, regularization of the irregular accumulation, and parallel reduction on GPU, as shown in Algorithm 2. Through these three steps, the irregular accumulation is changed to a regular reduction issue. The core of improvement is to construct a reduction space, which is used to orderly store the return signals in the same range gate. Then, the fast parallel reduction will be executed in the reduction space, and will yield the final accumulation result, as shown in Fig. 7. Note that an integer atomic addition will be applied for keeping the reduction space storage in order. From the viewpoint of the GPU program, the reduction is basically not time consuming after seven optimizations of Nvidia Corporation, and two float-point atomic operations decrease to one integer atomic operation. Theoretically, a 2 speedup is expected with the proposed reduction optimization.

5 TFlops 4 Total dedicated GPU memory 4 GB 4 GPU memory bandwidth 288 GB/s Number of CPU cores 12 2 Fig. 7. Fig. 8. Access conflict optimization based on irregular reduction.

8 394 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 10, NO. 2, FEBRUARY 2017 TABLE II HARDWARE SPECIFICATIONS Parameters Value Number of CUDA cores GPU float performance 1.5 TFlops 4 Total dedicated GPU memory 4 GB 4 GPU memory bandwidth 288 GB/s Number of CPU cores 12 2 Fig. 7. Fig. 8. Access conflict optimization based on irregular reduction. DEM data (up) and its 3-D visualization (down) of simulated scene. TABLE I SIMULATION PARAMETERS Parameters Wave length PRF Pulse width Band width Sampling rate Velocity Azimuth beam width Center distance Value 0.05 m 2000 Hz 30 μs 50 MHz 60 MHz 7000 m/s km IV. EXPERIMENTAL RESULTS The deep multiple CPU/GPU collaborative computing based SAR raw data simulation includes three improvements, namely multicore AVX parallel on CPUs, multiple CPU/GPU collaborative simulation framework, and irregular reduction based access conflict elimination on GPU. Through deep collaborative computing and optimized GPU simulation, the overall efficiency should be improved. Five categories of SAR raw data simulation experiments are designed to discuss the acceleration performance of the multicore AVX, multiple CPU/GPU collaborative simulation, irregular reduction optimization on GPU, the impact of introducing errors into simulation, and the accuracy of the proposed method. In order to evaluate the results, four groups of experiments in different data sizes are considered, such as , , , and , which are obtained by a bilinear interpolation of the experimental scene. The employed DEM scene is represented in a slant range coordinate system, and its maximum height is 1200 (m), as shown in Fig. 8. About the hardware environment, two Intel Xeon E5 CPUs, including 24 threads, and two NVIDIA Tesla K10 GPUs, including four GK104 GPUs, are used in the experiments, whose simulation parameters and hardware specifications are listed in Tables I and II. Besides, the software environment consists of four components. Specifically, the operating system is Red Hat Linux 6.5, in which the Intel C++ Composer XE 2015 is employed as the compiler, the Intel MKL library is used for FFT processing of the CPU code, and CUDA 6.5 is selected to drive the GPU parallel computing. Furthermore, OpenMP is employed for 24 threads parallel processing, AVX is exploited for SIMD vector extension, and CUDA is used for 4 GPUs parallel computing. Specifically, in collaborative computing mode, there are 20 threads working for the CPU parallel simulation and 4 threads working for 4 GPUs control. Noted that the sequential single-core CPU method is employed as the baseline program for the speedup test. The running time of the CPU code considers only the core raw data simulation, and the results of the GPU code take into account the input/output time, namely GPU memory allocation and the data transfer between the CPU and GPU, and the core GPU simulation time. A. Performance Analysis on Multicore AVX Acceleration First, we try to evaluate the performance of multicore vector extension technology. Four experiments are executed on a single-core CPU, a six-core CPU using OpenMP and AVX, two

9 ZHANG et al.: DEEP COLLABORATIVE COMPUTING BASED SAR RAW DATA SIMULATION ON MULTIPLE CPU/GPU PLATFORM 395 TABLE III SIMULATION TIME (S)COMPARISON TABLE IV COLLABORATIVE SIMULATION TIME (S)COMPARISON Scene size Serial 1 AVX CPU 2 AVX CPU 1 GPU scene size serial 4 GPU 4 GPU + 1 CPU 4 GPU + 2 CPU Fig. 10. Performance comparison among three HPC configurations. Fig. 9. The acceleration of the multicore AVX parallel and GPU parallel methods. CPUs, and a single GPU, respectively, and the results are shown in Table III. From the results, it can be seen that the multicore vectorization parallel based method improves the efficiency of the sequential method by about 38 speedup per CPU. For one six-core CPU parallel without AVX vectorization, 12 threads are started up by OpenMP. Then, the speedup should be 10.8 if it is assumed that the parallel efficiency is around 90%. Accelerated by the AVX vectorization, namely processing four data items with one instruction, it is expected that there is extra 4 speedup. As the simulation is not completely vectorized by AVX due to the raw data, the expected result should be less than 43.2 speedup. As for the experiment results, they are reasonable and initially prove that there is at least 3 4 speedup when introducing AVX to multicore parallel computing. For the fourth experiment, the simulation time is reduced from 7.7 h in sequential mode to 8.5 min in multicore AVX mode. So, the introduction of multicore and AVX parallels significantly improves the raw data simulation efficiency. Moreover, the performance comparison between AVX and a GPU is the most important topic in this paper, and determines the feasibility of the deep collaborative computing method. There will be no basis for multiple CPU/GPU deep collaborative computing if the computing capability gap between them is too large. Their acceleration performance comparison is illustrated in Table III and Fig. 9. It proves that the two CPU parallel with AVX support is basically close to single GPU computing in terms of performance in all experiments, and can be considered as one GPU for task allocation in collaborative computing. Therefore, the AVX method is a useful complement to the existing multicore CPU parallel computing, even collaborative computing. B. Performance Analysis on Deep Collaborative Computing Method The excellent computing ability of the AVX method lays the foundation of deep multiple CPU/GPU collaborative computing. As the simulation task is not huge enough, the task partition is based on the static strategy, which evenly divides the task into five parts and distributes among GPUs and CPUs. Then, they implement their corresponding subtasks, which will be merged into the whole raw data finally. In most hardware configurations, one multicore CPU is a more realistic scenario. So, the raw data simulation experiments based on four GPUs, four GPUs plus one CPU, and two CPUs are performed to test the collaborative computing effect, the simulation time, and speedup, as listed in Table IV. From the results, it illustrates that the overall simulation efficiency is improved by up to 60, and the four GPUs simulation efficiency is promoted up to 30%. Fig.10 illustrates the SAR raw data simulation speedups for the aforementioned three ways. The conclusion is distinct in that the proposed collaborative computing not only improves the existing multi-gpu-based SAR raw data simulation, but also makes full use of hardware computing resources for higher efficiency. Hence, the deep collaborative computing method is feasible and applicable. In order to construct a fair comparison between the CPU and GPU, another simulation experiment is designed by considering three GPUs, four GPUs, and three GPUs + two CPUs. Through this experiment, the computing capacities of each of those combinations can be further exposed. The results are shown in Table V. From the results, it can be seen that the efficiency of two AVX based CPUs in collaborative computing is better than one GPU in that the parallel efficiency of multi-gpus computing is not high enough. The possible reason is related with the hardware design of the K10 GPU, which has two internal GK104

10 396 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 10, NO. 2, FEBRUARY 2017 TABLE V SIMULATION TIMES (S) COMPARISON scene size 3 GPU 4 GPU 3 GPU + 2 CPU TABLE VII ACCESS CONFLICT OPTIMIZATION SPEEDUP WITH RESPECT TO SEQUENTIAL METHOD scene size Classical GPU Improved GPU Acceleration Fig. 11. TABLE VI TIME(S)ANALYSIS OF GPU MEMORY OPERATIONS scene size memory operation GPU computing The acceleration analysis of irregular reduction algorithm. simulation. The part of returned signal coherent accumulation costs about 80% of the total simulation time in GPU-based simulation. The main reason is the irregularity of coherent accumulation, which is realized by sequential atomic operations and slows down the overall simulation efficiency. Therefore, the optimization for the sequential accumulation with atomic operation is effective and feasible. To deeply analyze the optimization of an irregular reduction algorithm, we deploy a raw data experiment under the condition of a single GPU, and attempt to understand the running time of atomic operations and irregular reduction optimization. In order to enlarge the effect of coherent accumulation, we simulate more target points in one range gate (results shown in Fig. 11 and Table VII), where a classical GPU and an improved GPU indicate the coherent accumulation with atomic operation and irregular reduction, respectively. From the results, it can be seen that the running time of a single GPU is increased with the number of coherent accumulated targets. Also, it is clear that the irregular reduction optimization achieves around 20% speedup, which is in accord with our theoretical analysis. After this optimization, the raw data simulation based on multiple CPU/GPU collaborative computing will reach a higher running efficiency. GPUs. If four separate GPUs are employed, such as four K20 GPUs, this parallel efficiency issue will be improved. In the aforementioned calculation of simulation time, the GPU part takes into account the memory operations, not the CPU part. It is necessary that the running time of memory operations should be demonstrated to further understand these two kinds of parallelism. Although the memory operations include the GPU memory allocation, data transfer, and memory release, there is only a bit of elapsed time. Considering the fourth experiment, for example, the data transfer volume mainly includes the position of platforms and targets, scattering coefficients of targets, and the raw data, and is no more than 1 gigabyte (GB). The bandwidth of CPU GPU is about 8 GB/s, so the theoretical transfer time is below 1 s. The elapsed times of memory operation and pure GPU computing are shown in Table VI. It is shown that the ratio of data transfer timer is below 1% with increase of data volume. Therefore, the data transfer is not a major factor in the GPU-based SAR raw data simulation, and brings a very weak impact on collaborative computing. C. Performance Analysis on Irregular Reduction Optimization As the previous analysis, the coherent accumulation is the main time-consuming part of the whole time-domain raw data D. Performance Analysis on Operative Condition Simulation Time-domain simulation is able to exactly account for the effects of sensor trajectory deviations, but its high computational complexity allows one to only deal with imaged scenes consisting of one or a few point scatterers [4]. Strengthened by HPC technologies, the time-domain algorithm is capable of simulating more scatterers and even an extended scene. Nevertheless, its simulation efficiency is still less than the frequency-domain algorithm. The key advantage of the time-domain algorithm is to easily take into account the systematic errors and motion error, which are constant in each azimuth time. Hence, these errors can be calculated in advance, and are introduced into the raw data simulation by simple additions or multiplications in each azimuth time, which will not slow down the efficiency too much. Furthermore, in raw data simulation for a practical mission, the channel errors of transceiver hardware always come from the real data, which should be transferred to memory for simulation computing. Compared with the theoretical simulation, the increased calculation time of operative condition simulation is mainly reflected in the data transfer and arithmetic operations. Therefore, two experiments considering the trajectory deviation and system channel error are designed to analyze their impact on the simulation efficiency, and the results are shown in

11 ZHANG et al.: DEEP COLLABORATIVE COMPUTING BASED SAR RAW DATA SIMULATION ON MULTIPLE CPU/GPU PLATFORM 397 TABLE VIII SIMULATION TIME(S)COMPARISON BETWEEN IDEAL AND OPERATIVE CONDITION SIMULATION scene size Trajectory Deviation Channel Error Ideal Fig. 12. Imaging result of multiple CPU/GPU collaborative simulated raw data for point target embedded scene. TABLE IX SIMULATION ACCURACY FOR EMBEDDED POINT TARGET Res Coef PSLR ISLR Azimuth Range Table VIII. As these errors are calculated in advance, different error models have the same introducing process, namely the error transfer and error additions. Hence, two simplified models are employed to generate these errors. For the space-borne SAR trajectory deviation simulation, the 0.1(m) fixed deviations of the three axes are directly added to the coordinates of the platform. As for the channel error simulation, random phase errors are directly added to the phase of each accumulated return signal. Compared with the ideal simulation, the operative condition simulation increases the running time a little, which proves its merit in considering the space-varying and time-varying errors. E. Accuracy Analysis on Deep Collaborative Computing Method After discussing the acceleration of the deep collaborative computing method, an accuracy analysis of the method is given in this section. An experiment embedded point target into the scene is designed to discuss its simulation accuracy for the point target and the extended scene. We first calculate the scattering coefficient map by considering the SAR geometry and speckle noise, then fill the left-up area of with zero, and put one point target in the middle of this area. This experiment includes three steps, namely collaborative computing based raw data simulation for the design scene, imaging with the Chirp Scaling algorithm, and image quality assessment. The resolution (Res), expansion coefficient (Coef), peak-side lobe ratio (PSLR), and integral-side lobe ratio (ISLR) are used for SAR image assessment, which are also the indicators of the deep collaborative computing based raw data simulation accuracy. The imaging result is shown in Fig. 12. We can see that the image result is in accord with the input DEM data, and the point target is also accurately focusing. The embedded point target quality assessment is carried out, and the results are listed in Table IX. From these results, we can conclude that the deep collaborative computing approach can guarantee the simulation accuracy, and can be applied to other SAR related computing issues. Through the above discussion, we summarize that the multicore vector extension based method brings new vitality to the existing CPU parallel hierarchy, and boosts the computing power of a CPU to bring it at par with a GPU. Then, the irregular reduction algorithm is an effective solution in eliminating the memory access conflicts caused by multiple GPU threads waiting, and greatly improves the parallel efficiency on a GPU. Finally, the deep collaborative computing of multiple hardware resources can obviously shorten the simulation period, and should be applied to practical applications, such as SAR system parameters design and imaging algorithm verification for a new operating mode. Therefore, the deep multiple CPU/GPU collaborative computing based SAR raw data simulation is particularly suitable for the case of a huge computation, especially the large-areas raw data simulation. V. CONCLUSION In this paper, we exploited the multiple CPU/GPU collaborative computing to solve the calculation bottleneck of SAR raw data simulation for a large area. We proposed a multiple SIMD CPU/GPU collaborative parallel based time-domain SAR raw data parallel simulation method. More specifically, three improvements were introduced: the first one was the multicore vector extension method, which not only greatly improved the computing power of a CPU and prompted the possibility of collaborative computing, but also made the CPU work under a similar SIMD framework to that of a GPU and realized the deep collaboration; the second one was the deep collaborative computing framework of multiple computing resources, which improved the simulation efficiency to be more than the popular multi- GPU simulation method; and the third one was the irregular

12 398 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 10, NO. 2, FEBRUARY 2017 reduction based access conflict elimination, which changed the irregular accumulation issue to a regular issue with regularization processing and achieved a satisfied speedup over the basic GPU method. The experimental results showed that the multicore vectorization method with two CPUs was close to the classical single GPU method, and boosted the singlecore CPU simulation efficiency by over 70 speedup, thus the collaborative computing achieving about 250 speedup. Besides, the irregular reduction method achieved 20% acceleration compared with the classical GPU simulation. Furthermore, the collaborative method improved the simulation efficiency through exploiting the idle CPU computing resource and GPU optimization, having advantages of energy saving and low hardware cost. The proposed method has been verified to be suitable for the airborne/space-borne SAR simulation, and is expected a better application in PolSAR or InSAR as the multidimensional data processing can be more easily partitioned along multiple dimensions and mapped to the deep collaborative computing framework of multiple resources. The future work of this research will apply the deep multiple CPU/GPU collaborative computing method to SAR imaging, and SAR target recognition and classification, and will conduct a preliminary attempt for the real-time processing and application of SAR products. REFERENCES [1] A. Moreira, P. Prats-iraola, M. Younis, G. Krieger, I. Hajnsek, and K. Papathanassiou, A tutorial on synthetic aperture radar, IEEE Geosci. Remote Sens. Mag., vol. 1, no. 1, pp. 6 43, Mar [2] G. Franceschetti, R. Guida, A. Iodice, D. Riccio, and G. Ruello, Efficient simulation of hybrid Stripmap/Spotlight SAR raw signals from extended scenes, IEEE Trans. Geosci. Remote Sens.,vol.42,no.11,pp , Nov [3] G. Franceschetti, M. Migliaccio, and D. Riccio, SAR simulation: An overview, in Proc. IEEE Geosci. Remote Sens. Symp., 1995, pp [4] E. Boerner, R. Lord, J. Mittermayer, and R. Bamler, Evaluation of TerraSAR-X spotlight processing accuracy based on a new spotlight raw data simulator, in Proc. IEEE Geosci. Remote Sens. Symp., 2003, pp [5] A. Mori and F. D. Vita, A time-domain raw signal simulator for interferometric SAR, IEEE Trans. Geosci. Remote Sens., vol. 42, no. 9, pp , Sep [6] G. Franceschetti, M. Migliaccio, D. Riccio, and S. Gilda, SARAS: A synthetic aperture radar (SAR) raw signal simulator, IEEE Trans. Geosci. Remote Sens., vol. 30, no. 1, pp , Jan [7] K. Eldhuset, Raw signal simulation for very high resolution SAR based on polarimetric scattering theory, in Proc. IEEE Geosci. Remote Sens. Symp., 2004, vol. 3, pp [8] X. Qiu, D. Hu, L. Zhou, and C. Ding, A bistatic SAR raw data simulator basedoninverseω-κ algorithm, IEEE Trans. Geosci. Remote Sens., vol. 48, no. 3, pp , Mar [9] A. Khwaja, L. Ferro-Famil, and E. Pottier, SAR raw data generation using inverse SAR image formation algorithms, in Proc. IEEE Geosci. Remote Sens. Symp., 2006, pp [10] B. Deng, Y. Qing, H. Wang, X. Li, and Y. Li, Inverse frequency scaling algorithm (IFSA) for SAR raw data simulation, in Proc. Int. Conf. Signal Process. Syst., 2010, vol. 2, pp [11] G. Franceschetti, M. Migliaccio, and D. Riccio, SAR raw signal simulation of actual ground sites described in terms of sparse input data, IEEE Trans. Geosci. Remote Sens., vol. 32, no. 6, pp , Nov [12] G. Franceschetti, A. Iodice, M. Migliaccio, and D. Riccio, A novel acrosstrack SAR interferometry simulator, IEEE Trans. Geosci. Remote Sens., vol. 36, no. 3, pp , May [13] S. Cimmino, G. Franceschetti, A. Iodice, D. Riccio, and G. Ruello, Efficient spotlight SAR raw signal simulation of extended scenes, IEEE Trans. Geosci. Remote Sens., vol. 41, no. 10, pp , Oct [14] G. Franceschetti, A. Iodice, S. Perna, and D. Riccio, SAR sensor trajectory deviations: Fourier domain formulation and extended scene simulation of raw signal, IEEE Trans. Geosci. Remote Sens., vol. 44, no. 9, pp , Sep [15] O. Dogan and M. Kartal, Efficient stripmap-mode SAR raw data simulation including platform angular deviations, IEEE Geosci. Remote Sens. Lett., vol. 8, no. 4, pp , Jul [16] A. Khwaja, L. Ferro-Famil, and E. Pottier, Efficient stripmap SAR raw data generation taking into account sensor trajectory deviations, IEEE Geosci. Remote Sens. Lett., vol. 8, no. 4, pp , Jul [17] Y. Su and X. Qi, OpenMP based space-borne SAR raw signal parallel simulation, J. Graduate School Chin. Acad. Sci., vol. 25, no. 1, pp , [18] X. Wang, L. Huang, and Z. Wang, Research on parallel arithmetic of distribute spaceborne SAR ground target simulation, J. Syst. Simul., vol.18, no. 8, pp , [19] F. Zhang, Y. Lin, and W. Hong, SAR echo distributed simulation based on grid computing, J. Syst. Simul., vol. 20, no. 12, pp , [20] B. Wang, F. Zhang, and M. Xiang, SAR raw signal simulation based on GPU parallel computation, in Proc. IEEE Geosci. Remote Sens. Symp., 2009, vol. 4, pp [21] F. Zhang, B. Wang, and M. Xiang, Accelerating InSAR raw data simulation on GPU using CUDA, in Proc. IEEE Geosci. Remote Sens. Symp., 2010, pp [22] F. Zhang, Z. Li, B. Wang, M. Xiang, and W. Hong, Hybrid generalpurpose computation on GPU (GPGPU) and computer graphics synthetic aperture radar simulation for complex scenes, Int. J. Physical Sci.,vol.7, no. 8, pp , [23] F. Zhang, C. Hu, W. Li, W. Hu, and H. Li, Accelerating time-domain SAR raw data simulation for large areas using multi-gpus, IEEE J. Select. Topics Appl. Earth Observ. Remote Sens., vol. 7, no. 9, pp , Sep [24] A. Acosta, V. Blanco, and F. Almeida, Dynamic load balancing on heterogeneous multi-gpu systems, Comput. Elect. Eng., vol. 39, no. 8, pp , [25] E. Torti, G. Danese, F. Leporati, and A. Plaza, A hybrid CPU-GPU realtime hyperspectral unmixing chain, IEEE J. Select. Topics Appl. Earth Observ. Remote Sens., vol. 9, no. 2, pp , Feb [26] A. Marowka, Analytical modeling of energy efficiency in heterogeneous processors, Comput. Elect. Eng., vol. 39, no. 8, pp , [27] E. Zhu, R. Ma, Y. Hou, Y. Yang, F. Liu, and H. Guan, Two-phase execution of binary applications on CPU/GPU machines, Comput. Elect. Eng., vol. 40, no. 5, pp , [28] S. Hong, T. Oguntebi, and K. Olukotun, Efficient parallel graph exploration on multi-core CPU and GPU, in Proc. Int. Conf. Parallel Archit. Compilation Techn., Oct. 2011, pp [29] Y. Yang, P. Xiang, M. Mantor, and H. Zhou, CPU-assisted GPGPU on fused CPU-GPU architectures, in Proc. IEEE 18th Int. Symp. High Perform. Comput. Archit., Feb. 2012, pp [30] R. Duarte, Improving performance of data-parallel applications on CPU- GPU heterogeneous systems, Master s thesis, Dept. Elect. Eng., Univ. Rhode Island, Kingston, RI, USA, [31] X. Wang, Y. Zhang, and S. Ding, A high performance FFT library with single instruction multiple data (SIMD) architecture, in Proc. Int. Conf. Electron., Commun. Control, Sep. 2011, pp [32] L. Zhang, X. Yang, and W. Yu, Acceleration study for the FDTD method using SSE and AVX instructions, in Proc. Int. Conf. Consumer Electron., Commun. Netw., Apr. 2012, pp [33] P. Esterie, M. Gaunard, J. Falcou, J.-T. Lapreste, and B. Rozoy, Boost.SIMD: Generic programming for portable simdization, in Proc. Int. Conf. Parallel Archit. Compilation Techn., 2012, pp [34] P. Wu, T. Yan, H. Zhang, and M. D. F. Wong, Efficient aerial image simulation on multi-core SIMD CPU, in Proc. IEEE/ACM Int. Conf. Comput.-Aided Des., San Jose, CA, USA, Nov. 2013, pp [35] P. Kristof, H. Yu, Z. Li, and X. Tian, Performance study of SIMD programming models on intel multicore processors., in Proc. IEEE Int. Parallel Distrib. Process. Symp. Workshops, May 2012, pp [36] W. Hwu, What is ahead for parallel computing, J. Parallel Distrib. Comput., vol. 74, no. 7, pp , Jul [37] X. Wu, N. Obeid, and W. Hwu, Exploiting more parallelism from applications having generalized reductions on GPU architectures, in Proc. IEEE 10th Int. Conf. Comput. Inform. Technol., 2010, pp

ZHANG et al.: DEEP COLLABORATIVE COMPUTING BASED SAR RAW DATA SIMULATION ON MULTIPLE CPU/GPU PLATFORM 399 [38] R. Nasre, M. Burtscher, and K.

Rubin, A new method for GPU based irregular reductions and its application to K-means clustering, in Proc. 4th Workshop Gen. Purpose Processor Using Graph. Process. Units, 2011, pp. 1 10. [40] X.

Jin, Modeling and a correlation algorithm for spaceborne SAR signals, IEEE Trans. Aerosp. Electron. Syst., vol. 18, no. 5, pp. 563 574, Sep. 1982. Wei Hu received the B.S. and M.S. degrees in computer science from Dalian University of Science and Technology, Dalian, China, in 1999 and 2002, respectively, and the Ph.

His research interests include computer graphics, computational photography, and scientific visualization. Fan Zhang (S 07 M 10) received the B.E.

degree in signal and information processing from Beihang University, Beijing, China, in 2005, and the Ph.D.

He is currently an Associate Professor of electronic and information engineering at Beijing University of Chemical Technology, Beijing.

13 ZHANG et al.: DEEP COLLABORATIVE COMPUTING BASED SAR RAW DATA SIMULATION ON MULTIPLE CPU/GPU PLATFORM 399 [38] R. Nasre, M. Burtscher, and K. Pingali, Atomic-free irregular computations on GPUs, in Proc. 6th Workshop Gen. Purpose Processor Using Graph. Process. Units, 2013, pp [39] B. Dhanasekaran and N. Rubin, A new method for GPU based irregular reductions and its application to K-means clustering, in Proc. 4th Workshop Gen. Purpose Processor Using Graph. Process. Units, 2011, pp [40] X. Huo, V. T. Ravi, and G. Agrawal, Porting irregular reductions on heterogeneous CPU-GPU configurations, in Proc. 18th Int. Conf. High Perform. Comput., 2011, pp [41] C. Wu, K. Liu, and M. Jin, Modeling and a correlation algorithm for spaceborne SAR signals, IEEE Trans. Aerosp. Electron. Syst., vol. 18, no. 5, pp , Sep Wei Hu received the B.S. and M.S. degrees in computer science from Dalian University of Science and Technology, Dalian, China, in 1999 and 2002, respectively, and the Ph.D. degree in computer science from Tsinghua University, Beijing, China, in He is currently an Associate Professor of computer science at Beijing University of Chemical Technology, Beijing. His research interests include computer graphics, computational photography, and scientific visualization. Fan Zhang (S 07 M 10) received the B.E. degree in communication engineering from the Civil Aviation University of China, Tianjin, China, in 2002, the M.S. degree in signal and information processing from Beihang University, Beijing, China, in 2005, and the Ph.D. degree in signal and information processing from the Institute of Electronics, Chinese Academy of Science, Beijing, in He is currently an Associate Professor of electronic and information engineering at Beijing University of Chemical Technology, Beijing. His research interests include synthetic aperture radar signal processing, high performance computing, and scientific visualization. Dr. Zhang has been a Reviewer for the IEEE TRANSACTIONS ON GEO- SCIENCE AND REMOTE SENSING, the IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, the IEEE GEOSCIENCE AND REMOTE SENSING LETTERS,andtheInternational Journal of Antennas and Propagation. Pengbo Wang received the Ph.D. degree in information and communication engineering from Beihang University (Beijing University of Aeronautics and Astronautics, BUAA), Beijing, China, in He was a Postdoctoral Researcher with the School of Electronics and Information Engineering, Beihang University. From 2010 to 2015, he was a Lecturer in the School of Electronics and Information Engineering, Beihang University. Since July 2015, he has been an Associate Professor in the School of Electronics and Information Engineering, Beihang University. He has authored and coauthored more than 40 journal and conference publications, and he is the holder of ten patents in the field of microwave remote sensing. His current research interests include high-resolution space-borne SAR image formation, novel techniques for space-borne SAR systems, and multimodal remote sensing data fusion. Chen Hu (S 15) received the B.E. degree in electronic and information engineering from Beijing University of Chemical Technology, Beijing, China, in 2013, where he is currently working toward the M.S. degree in the field of high performance computing. His research interests include parallel computing, distributed computing, and imaging processing. Wei Li (S 11 M 13) received the B.E. degree in telecommunications engineering from Xidian University, Xi an, China, in 2007, the M.S. degree in information science and technology from Sun Yat-Sen University, Guangzhou, China, in 2009, and the Ph.D. degree in electrical and computer engineering from Mississippi State University, Starkville, MS, USA, in Subsequently, he spent one year as a Postdoctoral Researcher at the University of California, Davis, CA, USA. He is currently in the College of Information Science and Technology, Beijing University of Chemical Technology, Beijing, China. His research interests include statistical pattern recognition, hyperspectral image analysis, and data compression. Dr. Li is an active Reviewer for the IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, the IEEE GEOSCIENCE REMOTE SENSING LETTERS,and the IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING (JSTARS). He received the 2015 Best Reviewer Award from the IEEE Geoscience and Remote Sensing Society for his service for the IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING. Heng-Chao Li (S 06 M 08 SM 14) received the B.Sc. and M.Sc. degrees from Southwest Jiaotong University, Chengdu, China, in 2001 and 2004, respectively, and the Ph.D. degree from the Graduate University of the Chinese Academy of Sciences, Beijing, China, in 2008, all in information and communication engineering. He is currently a Professor with the Sichuan Provincial Key Laboratory of Information Coding and Transmission, Southwest Jiaotong University. Since November 2013, he has been a Visiting Scholar working with Prof. W. J. Emery at the University of Colorado Boulder, Boulder, CO, USA. His research interests include statistical analysis and processing of synthetic aperture radar images, and signal processing in communications. Dr. Li received several scholarships or awards, including the Special Grade of the Financial Support from the China Postdoctoral Science Foundation in 2009 and the New Century Excellent Talents in University from the Ministry of Education of China in In addition, he has also been a Reviewer for several international journals and conferences, such as the IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, the IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, the IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, the IEEE TRANSACTIONS ON IMAGE PROCESS- ING, IET Radar, Sonar & Navigation, andthecanadian Journal of Remote Sensing. He is currently serving as an Associate Editor of the IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING.

Using CUDA to Accelerate Radar Image Processing

Using CUDA to Accelerate Radar Image Processing Aaron Rogan Richard Carande 9/23/2010 Approved for Public Release by the Air Force on 14 Sep 2010, Document Number 88 ABW-10-5006 Company Overview Neva Ridge