DUE to the imaging characteristics of day-and-night and

Size: px
Start display at page:

Download "DUE to the imaging characteristics of day-and-night and"

Transcription

1 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 10, NO. 2, FEBRUARY A Deep Collaborative Computing Based SAR Raw Data Simulation on Multiple CPU/GPU Platform Fan Zhang, Member, IEEE,ChenHu, Student Member, IEEE, WeiLi, Member, IEEE, Wei Hu, Pengbo Wang, and Heng-Chao Li, Senior Member, IEEE Abstract The outstanding computing ability of a graphics processing unit (GPU) brings new vitality to the typical computing intensive issue, so does the synthetic aperture radar (SAR) raw data simulation, which is a fundamental problem in SAR system design and imaging research. However, the computing power of a CPU was underestimated, and the tunings for a CPU-based method were missing in the previous works. Meanwhile, the collaborative computing of multiple CPUs/GPUs was not exploited thoroughly. In this paper, we propose a deep multiple CPU/GPU collaborative computing framework for time-domain SAR raw data simulation, which not only introduces the advanced vector extension (AVX) method to improve the computing efficiency of a multicore single instruction multiple data CPU, but also achieves a satisfactory speedup in the CPU/GPU collaborative simulation by fine-grained task partitioning and scheduling. In addition, an irregular reduction based SAR coherent accumulation approach is proposed to eliminate the memory access conflict, which is the most difficult issue in the GPU-based raw data simulation. Experimental results show that the multicore vector extension method greatly improves the computing power of a CPU-based method through about 70 speedup, thereby outperforming the single GPU simulation. Correspondingly, compared with the baseline sequential CPU approach, the multiple CPU/GPU collaborative simulation achieves up to 250 speedup. Furthermore, the irregular reduction based atomic-free optimization boosts the performance of the single GPU method by 20% acceleration. These results prove that the deep multiple CPU/GPU collaborative method is promising, especially for the case of huge volume raw data simulation with a wide swath and high resolution. Index Terms Advanced vector extensions (AVX), collaborative simulation, graphics processing unit (GPU), raw data generation, synthetic aperture radar (SAR). Manuscript received February 21, 2016; revised May 29, 2016 and July 5, 2016; accepted July 21, Date of publication October 27, 2016; date of current version January 23, This work was supported in part by the National Natural Science Foundation of China under Grant and Grant , in part by the Beijing Natural Science Foundation under Grant , and in part by the Fundamental Research Funds for the Central Universities under Grant BUCTRC and Grant BUCTRC (Corresponding author: Fan Zhang.) F. Zhang, W. Li, and W. Hu are with the College of Information Science & Technology, Beijing University of Chemical Technology, Beijing , China ( zhangf@mail.buct.edu.cn; leewei36@gmail.com; huwei@mail.buct.edu.cn). C. Hu is with the College of Information Science & Technology, Beijing University of Chemical Technology, Beijing , China ( huchen buct@ 163.com). P. Wang is with the School of Electronic and Information Engineering, Beihang University, Beijing , China ( wangpb7966@163.com). H.-C. Li is with the Sichuan Provincial Key Laboratory of Information Coding and Transmission, Southwest Jiaotong University, Chengdu , China ( lihengchao_78@163.com). Color versions of one or more of the figures in this paper are available online at Digital Object Identifier /JSTARS I. INTRODUCTION DUE to the imaging characteristics of day-and-night and weather-independent, synthetic aperture radar (SAR) has been widely used for Earth remote sensing for more than 30 years, and has come to play a significant role in geographical survey, climate change research, environment and Earth system monitoring, multi-dimensional mapping and other applications [1]. With the evolution of SAR technologies, SAR information acquisitions with high resolution, wide swath, multiple modes, and multiple dimensions, have been possible. In the foreseeable future, more advanced multiplatform, multimode, multiband, multipolarization SAR systems will be developed to satisfy the emerging requirements. SAR flight experiments being time consuming and highly expensive, computer simulations are often applied to assist the key technology research, system design, system development, and even the data applications. To quantitatively support the design of an SAR operating in the hybrid mode, to help mission planning, and to test processing algorithms, a SAR raw signal simulator is required, especially when real raw data are not available yet [2]. Therefore, it would be highly desirable that the accurate SAR raw data are quickly simulated to do such research work for the new generation of SAR sensors. From the perspective of simulation processes, SAR raw data simulations can be divided mainly into two classes: SAR oriented and SAR processing oriented [3]. The SAR oriented algorithms, including the time-domain algorithm [4], [5], FFT-based time-domain algorithm [2], and two-dimensional (2-D) frequency-domain algorithm [6], simulate the physical process of microwave transmission and reception, and then calculate the SAR raw data. Comparatively, the SAR processing oriented algorithms, including the inverse fourth-order extended exact transfer function algorithm [7], inverse ω κ algorithm [6], [8], inverse Chirp Scaling algorithm [9], and inverse frequency scaling algorithm [10], simulate the SAR raw data based on the inversion of SAR imaging processing. The 2-D frequency-domain algorithm and the inverse processing algorithm are efficient, and are always employed for the scene simulation of a stripmap [6], [11], [12], spotlight [13], hybrid mode [2], etc. But, the 2-D frequency-domain algorithm is difficult to consider the actual system errors. To account for these errors, some assumptions need to be introduced into these algorithms, resulting in some application limitations, such as narrow beam and slow deviation [14] [16]. On the other hand, the time-domain algorithm is highly time consuming, but is IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See standards/publications/rights/index.html for more information.

2 388 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 10, NO. 2, FEBRUARY 2017 able to exactly account for these system errors. Therefore, if we are interested in a raw signal simulation from a geometrically limited target where the platform is liable to mechanical oscillations, orbital deviations, etc., the time-domain approach is a good choice [5]. But, its high computational complexity allows one to deal only with an imaged scene consisting of one or a few point scatterers [4]. In order to extend its application to scene simulation, the optimization of efficiency should be further studied. For time-domain raw data simulation, the optimization of algorithm under the precondition of keeping precision intact does not achieve satisfactory performance. Due to the independence of return signal gathering, parallelization is the most straightforward idea, and can significantly shorten the simulation time with state-of-the-art high performance computing (HPC) technologies. In order to boost the efficiency of SAR raw data simulation, two categories of HPC methods have been employed: CPU oriented method and graphics processing unit (GPU) oriented method. The CPU oriented method mainly indicates the raw data parallel simulation on a CPU platform, such as open multiple processing (OpenMP) with multicores [17], a message passing interface (MPI) with multi-cpus [18], and grid computing with multicomputers [19]. The accelerating effects of these methods are proportional to the number of CPUs, which induces a high parallel simulation cost. A GPU oriented method realizes the parallel simulation with massive cores on a GPU platform [20] [23]. Compared with the CPU method, the GPU method achieves an acceleration of a dozen to several hundred times, and proves to be a high-efficiency, low-cost solution for raw data simulation. In our previous work [23], parallel simulation in one GPU achieves around 100 speedup over the CPU-based sequential algorithm. However, the CPU, as one kind of computing resource, was ignored in the GPU-based SAR raw data simulation. Typically, the CPU cores remain idle while the GPU cores are busy in computing. The heterogeneous CPU/GPU computing seems to be an optimal solution to further improve the simulation efficiency in that there is a growing trend toward the institutional use of multiple computing resources (usually heterogeneous) as a sole computing resource [24]. The heterogeneous simulation is implemented with hybrid multicore CPU parallel and massive cores parallel on a GPU, temporarily called Collaborative Computing 1.0 (CC 1.0) [25]. Although the CPU/GPU collaborative computing is employed in some computing-intensive cases [26] [30], it is not applicable for SAR raw data simulation due to the broad gap in performance between the CPU and GPU methods. Even the CPU is exploited to assist GPU computing, the effect is like a drop in a bucket. Recently, the single instruction multiple data (SIMD) extensions instructions, namely streaming SIMD extensions (SSE) and advanced vector extensions (AVX), have been introduced to further exploit the computing capability of a CPU. The SSE instructions method has been applied to classical topics, such as fast Fourier transform (FFT) [31], finite-difference time domain (FDTD) [32], image processing [33], [34], and graphics application [35], and achieved 4 to 73 speedups with multicore SIMD CPUs. Even with the SSE instructions, the acceleration is almost of the same level as that of a GPU parallel. More speedups can be expected if the SSE is expanded to AVX instructions. On the other hand, Hwu believes that the multicore CPU SIMD parallelism is an important trend in parallel computing, and can be effectively covered by multicore architectures [36]. Inspired by this research, we try to develop a deep collaborative computing method to make the AVX-based multicore CPU and the compute unified device architecture (CUDA) based massive core GPU work together to simulate the SAR raw data faster, called Collaborative Computing 2.0 (CC 2.0). To make CC 2.0 more efficient, a deep optimization of the GPU part is absolutely required. As the main body of collaborative computing, the GPU method has still an irregular parallel issue unresolved, namely the coherent accumulation of return signals, which leads to severe memory conflicts and reduces the simulation efficiency. The irregularities of signal accumulation are as follows. For each sampling unit, the distribution and number of target points that contribute to the return signals are changed with the movement of the flight platform, and can only be determined during the calculation of each azimuth sampling time. If these parameters are known and predictable, this coherent accumulation can be simplified to a generalized reduction issue, which is easy to parallelize. However, the coherent accumulation of return signals is an irregular reduction problem, in fact. As for GPU parallel computing, this kind of irregular and unpredictable application is difficult to achieve in high parallelization. In recent years, the GPU parallel algorithm for irregular issues has attracted a lot of attention [37] [40]. Although the specific irregularity of SAR coherent accumulation is different from those discussed in the aforementioned literatures, the idea of irregular reduction based parallel accumulation is feasible. Therefore, the GPU-based irregular reduction algorithm should be restudied to efficiently simulate the massive accumulation of return signals. This is for the first time that collaborative computing is applied to the SAR raw data simulation, which not only deeply exploits the computer resources to accelerate simulation, but also lays the foundation for the next-step implementation of distributed computing and cloud computing. Therefore, we propose a multicore SIMD CPU and multi-gpu deep collaborative computing based SAR raw data simulation method. Compared to our previous works [19] [23], we make the following contributions: 1) We introduce the AVX method to accelerate the multicore CPU-based SAR raw data simulation, and narrow the efficiency gap between CPU and GPU simulation. 2) We propose a multiple CPU/GPU deep collaborative computing framework that exploits the computing resources as much as possible and offers the CPU to handle more computing tasks like a GPU, to improve the whole efficiency of time-domain SAR raw data simulation. 3) We propose an irregular reduction mechanism to eliminate the memory access conflict in the classical GPU and multi- GPUs based raw data simulation, and further improve the computing efficiency. The rest of this paper is organized as follows. Section II briefly introduces the FFT-based time-domain SAR raw data simulation algorithm and the parallel computing mechanism. Section III presents the proposed multiple CPU/GPU deep collaborative computing based raw data simulation algorithm. Then, the

3 ZHANG et al.: DEEP COLLABORATIVE COMPUTING BASED SAR RAW DATA SIMULATION ON MULTIPLE CPU/GPU PLATFORM 389 experimental results and optimization analysis are discussed in Section IV. Finally, conclusions are drawn in Section V. II. TIME-DOMAIN SAR RAW DATA SIMULATION ALGORITHM AND ITS PARALLELIZATION A. SAR Raw Data Simulation Algorithm The echo signal model [41] is applicable for airborne and space-borne SAR data. Assuming that the transmitting signal is a linear frequency modulated signal pulse, given by ( ) τ s t (τ) =s r (τ) exp (jw c τ)=rect exp ( jw c τ + jπk r τ 2) T p (1) where s r (τ) is the chirp signal, w c is the carrier frequency, τ is the range time, T p is the signal pulse width, k r is the chirp rate, and rect( ) is the range window of the pulse. Through the heterodyne receiver, the single point echo in the SAR stripmap mode is expressed as 2-D s i (t, τ) s i (t, τ) = T σ i W a (θ i ) exp n=0 ( j 4 πr i(t n ) λ ( s r τ 2r ) i(t n ) (2) c where i is the order number of a single point in the target matrix, n is the order number in azimuth time, T is the number of azimuth samplings, t is the azimuth time, σ is the scattering coefficient, W a is the azimuth antenna pattern, θ i is the angle between the azimuth phase center and target i, T p is the signal pulse width, r i (t n ) is the distance between target point i and the radar antenna phase center at time t n, λ is the wave length, and c is the speed of light. When the simulation objects are distributed targets, the discrete representation of an SAR echo signal can be obtained by s(t, τ) = M i=0 ) s i (t, τ) (3) where M is the total number of target points. In practical engineering calculations, the FFT-based timedomain method, which calculates the scattering target points accumulation by frequency-domain multiplication, is often applied for raw data generation with s(t, τ) = s a (t n,τ)= = T s a (t n,τ) s r (τ) n=0 T n=0 M i=0 δ f { f [ s a (t n,τ) ] } S r (ξ) σ i W a (θ i ) exp ( τ 2 r ) i(t n ) c ( j 4 πr ) i(t n ) λ (4) (5) Fig. 1. The FFT-based time-domain SAR raw data simulation diagram. where f( ) is the Fourier transform operator, f ( ) is the inverse Fourier transform operator, S r (ξ) is the linear FM Signal spectrum, δ( ) is the Dirac delta function, and indicates the convolution operator. In the procedure of simulation, the linear FM signal spectrum S r (ξ) does not change, while the azimuth signal spectrum changes with different scattering points and azimuth times. According to (4) and Fig. 1, the raw data simulation algorithm includes the following five steps: Step 1: One-dimensional Fourier transform of s r (τ) is performed to yield S r (ξ). Assuming that there is no moving target in the simulated scene, the constant scattering map σ and the azimuth antenna pattern W a are loaded in advance. Step 2: The geometry is calculated to obtain the range r and angle θ. Furthermore, the targets within the footprint are selected by determining if their θ i are smaller than the half of azimuth beam width. Then, s a (t n,τ) is calculated and transformed into the frequency domain to generate S a (t n,ξ). Step 3: The multiplication of S r (ξ) and S a (t n,ξ) is implemented. Step 4: The raw signal is achieved by the inverse Fourier transform of results in Step 3. Step 5: For all the azimuth sampling times, Steps 2 4 are repeated to get the complete simulated raw data. B. Parallelization of Raw Data Simulation According to the stop-and-go model, SAR raw data simulation is a serial time process, and then the coupling of transmitting and receiving pulses at different times is small. Therefore, we can take the transmitting and receiving pulses as the task unit, which will be dispatched to every computation node and calculated quickly by MPI, grid computing, or other parallel technologies. The parallelism of SAR raw data simulation can be divided into a coarse-grained strategy and a fine-grained one, as shown in Fig. 1. The traditional parallel approach belongs to the former, which takes the repetitious transmitting and receiving pulses process as one task. The process completes the task assignment through dispatching a reasonable number of simulated pulses to

4 390 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 10, NO. 2, FEBRUARY 2017 different nodes, CPUs, CPU cores, given by s(t, τ) = m k=0 D k = m k=0 T k +1 n=t k s(t n,τ) (6) in which D k represents the calculation task of node k and m indicates the number of subtasks. Comparatively, a parallel simulation based on a GPU is a fine-grained parallel method, which optimizes the largest-timeconsuming step. The task of every thread is the azimuth signal calculation of a single scattering point and a single sampling point multiplication (as shown in Fig. 1), given by [ T M ] } s(t, τ) = f {f S(ξ) = n=0 T f n=0 N i=1 D (n,i) D (n,j) j=0 where D (n,i) is the azimuth signal of point i in time t n, D (n,j) is the spectrum product of the linear FM signal and azimuth signal at range gate j in time t n, and N is the number of range gates. With parallel task decomposition from coarse-grained D k to fine-grained D (n,i) and D (n,j), a higher efficiency of the parallel simulation is achieved. The SAR raw data fine-grained parallel simulation with CUDA not only accords with the physical process of echo generating, but also takes full advantage of GPU hardware resources and computing power. The method with CUDA is very suitable for large-area SAR raw data simulation. However, there is an obvious issue in fine-grained parallel partitioning, namely the memory access conflict. In each flight sampling time, the signals from different target points at a similar distance will be collected in the same sampling unit, which also calls the range gate. The different targets return signals need to be accumulated in the same memory cell if their range differences do not exceed the length of range gate. Hence, there exists a massive memory access conflict. For coarse-grained parallel simulation, its parallel task D k includes all targets return signals simulation, and is completed by one thread. Consequently, there is no access conflict on the accumulation of single-threaded execution. But, for the fine-grained parallel simulation, the parallel task D (n, i) includes only one target s return signal simulation. The accumulation requires the participation of all threads, yielding the access conflicts. To avoid such problems, the method of a thread synchronization lock has been considered, such as the atomic operation of CUDA. The essence of atomic operation is to ensure a single thread access to resources, while to leave the other threads in a waiting state. So, the parallel computing efficiency is reduced by around several times. The reduction algorithm is an optimal solution for huge data accumulation. The classical GPU-based reduction algorithm has been well optimized through different strategies, and achieves around 30 speedup. But, the return signal accumulation has its irregularities, including the uncertainty of the accumulated location and number, as shown in Fig. 2. These irregularities lead to difficulties in applying the classical reduction algorithm. (7) Fig. 2. The irregularities of SAR coherent accumulation. III. MULTIPLE CPU/GPU DEEP COLLABORATIVE BASED SAR RAW DATA SIMULATION With the rapid development of HPC devices, heterogeneous and collaborative computing are the future trends in solving computing-intensive problems, especially the SAR raw data simulation of large areas. Our previous work focuses only on using one category of computing device to accelerate the simulation, and actually needs to be expanded to multiple computing resources for higher efficiency simulation. Therefore, we mainly discuss the multiple CPU/GPU deep collaborative simulation framework, the improved multicore CPU parallel algorithm, and the improved GPU parallel algorithm, and try to improve the stand-alone computing power for each device, thus boosting the time-domain SAR raw data simulation with collaborative computing. A. Multiple CPU/GPU Deep Collaborative Simulation Framework Due to the gap in the capacity of computing between a CPU and a GPU, the CPU takes on only auxiliary tasks in the traditional CPU/GPU collaborative computing. In this sense, this kind of collaborative computing is actually a GPU-based approach in that the main computing tasks are performed by massive GPU threads. The concept of deep collaborative means that the CPU is not only seen as a manager, but also treated as a worker sharing the tasks of GPU computing. To achieve the deep collaborative computing, the computing capacity of a CPU should be strengthened by OpenMP-based multicore parallel and AVX-based vectorization parallel. After minimizing the gap between a CPU and a GPU in terms of computing power, they can be seen as unified computing devices to undertake the simulation tasks. The whole deep collaborative computing based SAR raw data simulation is shown in Fig. 3. In the deep collaborative simulation framework, there are three main processing steps, namely task partitioning and scheduling on a CPU, collaborative computing on multiple CPUs/GPUs, and data merging on a CPU. Due to the rigorous independence of returned signal simulation in different azimuth times, it can be taken as a task unit to be distributed among different computing resources. As shown in Fig. 3, the whole azimuth time, namely the whole simulation period, is divided

5 ZHANG et al.: DEEP COLLABORATIVE COMPUTING BASED SAR RAW DATA SIMULATION ON MULTIPLE CPU/GPU PLATFORM 391 Fig. 4. Task partitioning strategy. Fig. 3. Multiple CPU/GPU deep collaborative computing for raw data simulation. into five parts, which are dispatched to four GPUs and multiple CPUs. Then, GPUs and CPUs execute their simulation tasks simultaneously. In the GPU simulation part, CUDA and its FFT library are employed to realize the raw signal simulation. In the CPU simulation part, OpenMP is used for multicore task scheduling and implementation, and AVX is applied to each CPU core to do the numerical operations and FFTs in vectorization mode. The result of CPUs are stored in place, and the results of GPUs need to be transferred to CPU memory. After merging GPU results with CPU results, the whole simulated SAR raw data are finally obtained. B. Multiple Tasks Partitioning and Scheduling According to the parallelization analysis of raw data simulation, the coarse-grained parallel and fine-grained parallel are employed by using the CPU method and GPU method, respectively. In our previous work on multi-gpu-based raw data simulation, the hybrid scheduling of coarse-grained and fine-grained parallels is introduced to finish the whole simulation. Specifically, the coarse-grained tasks D k are distributed to different GPU cards, and the fine-grained tasks D (n,i) are assigned to different GPU threads. As for the multiple CPU/GPU collaboration, the AVXbased multicore CPU and GPU almost have the same com- puting power. Therefore, the hybrid scheduling strategy can be exploited here, which takes AVX as another GPU card for task partition and scheduling. The task partitioning and scheduling strategies need to consider not only the limitation of memory size, but also the SAR principle, which indicates that the synthetic aperture effect should be considered in task partition, as showninfig.4,wherena and Nr are the size of raw data in azimuth and range directions, T is the execution time of each subtask, and L a is the synthetic aperture length. Under different simulation conditions, the target dataset can be equally divided inton parts along the flight direction. Subdata are distributed to various CPUs and GPUs simultaneously, and then redistributed after the calculation. According to the SAR principle, the simulated area of Na/n in width can generate raw data of Na/n+ La in width. So, there is an overlap of La width between different calculation results (as illustrated in Fig. 4), which should be additionally considered in signal merging. On the other hand, we can also equally divide the task in raw signal domains and make the overlap of La width between different target data in the target domain. Although the partitioning strategy is a bit complicated, the data merging is relatively simple. On the basis of the task partitioning strategy, the scheduling strategy should be discussed specifically in that the scheduling of the same computing devices and different computing devices is different. The traditional multi-gpus based method is a combination of the serial processing in CPU and the parallel processing in GPU. The serial code functions include the input operation, task partitioning and scheduling, and data transfer between a CPU and a GPU. According to distributed tasks, the multi-gpus simulate the SAR raw signals in a massive parallel mode. Since the computing capacity of GPU cards is similar or computable, the static allocation strategy, namely the one-time task allocation, combined with a parallel pipeline for hiding data transfer, proves to be feasible and efficient. But, the difference between CPU and GPU computing is hard to estimate in multiple CPU/GPU collaborative computing. The static scheduling

6 392 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 10, NO. 2, FEBRUARY 2017 Fig. 5. Fig. 6. Task dynamic scheduling strategy. AVX parallel based signal coherent accumulation simulation. strategy will lead to the idle computing resources and reduce the collaborative efficiency. Therefore, the task should be divided into smaller parts to gradually balance the CPU part and GPU part. The first few blocks are employed to test the elapsed time difference between CPU and GPU computing. After several rounds of dynamic task adjustment, CPUs and GPUs are fully occupied, and the optimal collaborative efficiency is expectable, as shown in Fig. 5. Otherwise, in the calculation of running time, only the parallel simulation time is considered in the CPU part for its in-place operation. But, for the GPU part, in addition to GPU computing, the data transfer and data merging are taken into account. C. Multicore CPU Vectorization Parallel Based Raw Data Simulation Compare to the wide application of a GPU, the computing power of a CPU is usually underestimated. Furthermore, the general CPU parallel application is still stuck in the old way of multicore through OpenMP and multi-cpu through MPI. In order to further exploit the CPU parallel capability, the optimal solution should be the SIMD computing model, which is implemented by the SSE and AVX instructions. The SSE instructions can perform four basic operations by only one instruction, that is, SSE gives us an extra 4 speedup on the basis of multicore acceleration, as shown in Fig. 6. Furthermore, the AVX technologies, supported by the new-generation CPUs of Intel and AMD introduced in 2011 and after, expand the SIMD register from 128 to 256 bits, which means one AVX instruction can operate eight single precision floating point data. Therefore, with the combination of OpenMP-based multicore parallel and Algorithm 1: SAR raw data simulation: AVX-based multicore CPU version 1: for each n [0,T 1] do 2: Multi-core: each thread separately simulates for raw data in each azimuth time t n 3: for each i [0,M 1],i+ =4do 4: SIMD: i 4 indicates the vector operations of four items under one AVX instruction 5: r 4 i (t n) compute the range of each of the four targets 6: fp 4 i (t n) determine whether the four targets are in the footprint 7: rg 4 i (t n) compute the range gate of each of the four targets 8: θ 4 i (t n) compute the return signal phase of each of the four targets 9: end for 10: for each i [0,M 1],i+ =1do 11: Serial code: implement the irregular coherent accumulation 12: s a [rg i (t n )]+ = σ i exp(θ i (t n )) 13: end for 14: 15: SIMD: the following executions are with AVX 16: f(s a ) perform 1-D Fourier transform 17: f(s a ) S r compute the multiplication of two frequency-domain signals 18: f {f(s a ) S r } perform 1-D inverse Fourier transform 19: end for AVX-based vector extension, the SAR raw data simulation can expect a GPU-level speedup by applying AVX on a multicore CPU. The improved CPU approach includes two levels of parallelism: coarse-grained parallel in multicore and fine-grained parallel in vectorization. For the multicore parallel, the task unit is the whole returned signal simulation in each azimuth time, which is a coarse-grained parallelism. For the vectorization parallel, the task unit is a numerical operation in the process of simulation, which is a further processing on multicore parallel. As for the detailed implementation, SAR raw data simulation can be seen as a series of numerical operations and FFTs. The AVX-supported FFT library can be employed for the FFT operations in the simulation. Therefore, the key issue is how to use AVX for vectorization processing of numerical operations. In order to explain the AVX-based numerical operations, the vectorization processing of signal coherent accumulation, which is the most time-consuming part in the whole raw data simulation, is shown in Fig. 5 and Algorithm 1. The task of each OpenMP core is the calculation of returned signal in each azimuth time in which the AVX instructions are performed. Moreover, the vectorization processing requires regular data storage for vector construction. So, the following different vectorization strategies will be applied in each OpenMP core computing operation:

7 ZHANG et al.: DEEP COLLABORATIVE COMPUTING BASED SAR RAW DATA SIMULATION ON MULTIPLE CPU/GPU PLATFORM 393 1) In azimuth signal calculation, vectorization toward target points will be used, namely one SSE/AVX instruction will finish the range history and corresponding complex signal calculations for 2/4 target points. 2) In the step of return signal coherent accumulation, a vector extension method is not applicable. Because the return signals of each of the adjacent four target points may not be received in the continuous range gate, the accumulation will be processed as a sequential execution. 3) In the frequency transformation step, the latest FFTW library with SSE/AVX support is used to accelerate calculation with vector extension. 4) In spectrum multiplication, vectorization toward data dimension is applied, namely one SSE/AVX instruction will finish multiplications for 2/4 data sampling points. D. Improved GPU Parallel Simulation With Irregular Reduction Algorithm As the previous analysis, the irregularities of signals coherent accumulation make it hard for fine-grained parallelization. For a multicore vector extension method, the distribution of target points that contributed to the same range gate is discontinuous, which makes the vector construction difficult. For the GPU method, the uncertainty of target points in the same range gate leads to sever access conflict. Although the classical GPU based SAR raw data simulation is a kind of mature parallel algorithm [20] [23], which has been extensively discussed, the access conflict issue in the GPU-based signal coherent accumulation is still not thoroughly solved. In this section, we mainly discuss the optimization of the access conflict in the GPU method. We have employed three kinds of methods to implement the coherent accumulation of GPU fine-grained parallel. First, our earlier GPU-based simulation articles [20] can achieve a desirable speedup on GPU processing without considering the access conflict in that each range gate places only one target for ideal simulation. In fact, there are many target points gathered in each range gate. There must be a wrong result if there is no sequence control over massive accumulation operations in one range gate. Second, the atomic operation was used to control the order of each thread s accumulation and guarantee the result [21]. The local sequential execution in GPU parallel decreases the parallel efficiency by several times, which brings challenges to the ongoing optimization or additional hardware investment on multi-gpus based SAR raw data simulation. Third, the interpolation-based method [23] was employed to decrease the number of targets in the same range gate, which reduced the number of conflicts. But, it is an inverse processing solution that uses the memory space to gain the computing efficiency. The estimation of interpolation factor is not always optimal. Thus, we try to discuss the forward processing solution of this issue. The most straightforward solution is the classical reduction algorithm. For regular accumulation, the computational complexity reduces from O(n) to O(log 2 (n)) considering the reduction algorithm. So, the reduction algorithm will boost the accumulation compared with the serial execution by atomic operation. Since Algorithm 2: Coherent accumulation: GPU version with irregular reduction 1: for each n [0,T 1] do 2: Multiple GPUs: each GPU separately simulates for raw data in each azimuth time t n 3: for each GPU thread i [0,M 1] do 4: GPU thread: each thread finishes the return signal computation and accumulation of each target i 5: r i (t n ) compute the range of target i 6: fp i (t n ) determine whether target i is in the footprint 7: rg i (t n ) compute the range gate of target i 8: θ i (t n ) compute the return signal phase of target i 9: 10: /*Regularization of irregular accumulation*/ 11: atomicadd(index i, 1) compute the sequence number in the accumulation queue of the same range gate using an integer atomic add 12: S redu [index i ]=σ i exp(θ i (t n )) move the signal result to the reduction space 13: end for 14: 15: /*Regular reduction calculation*/ 16: S redu [0] = Reduction(S redu ) implement GPU based regular reduction on reduction space 17: s a [rg i ]=S redu [0] write back to the raw data space 18: end for the total number and distribution of targets in the same range gate are unknown, they need to be calculated in real time by each GPU thread. Only if the object data and object number are obtained, the reduction algorithm could show its advantages in computing efficiency. Therefore, the initial idea is to divide the original GPU-based coherent accumulation into three parts, including the calculation of range gate and signal phase, regularization of the irregular accumulation, and parallel reduction on GPU, as shown in Algorithm 2. Through these three steps, the irregular accumulation is changed to a regular reduction issue. The core of improvement is to construct a reduction space, which is used to orderly store the return signals in the same range gate. Then, the fast parallel reduction will be executed in the reduction space, and will yield the final accumulation result, as shown in Fig. 7. Note that an integer atomic addition will be applied for keeping the reduction space storage in order. From the viewpoint of the GPU program, the reduction is basically not time consuming after seven optimizations of Nvidia Corporation, and two float-point atomic operations decrease to one integer atomic operation. Theoretically, a 2 speedup is expected with the proposed reduction optimization.

8 394 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 10, NO. 2, FEBRUARY 2017 TABLE II HARDWARE SPECIFICATIONS Parameters Value Number of CUDA cores GPU float performance 1.5 TFlops 4 Total dedicated GPU memory 4 GB 4 GPU memory bandwidth 288 GB/s Number of CPU cores 12 2 Fig. 7. Fig. 8. Access conflict optimization based on irregular reduction. DEM data (up) and its 3-D visualization (down) of simulated scene. TABLE I SIMULATION PARAMETERS Parameters Wave length PRF Pulse width Band width Sampling rate Velocity Azimuth beam width Center distance Value 0.05 m 2000 Hz 30 μs 50 MHz 60 MHz 7000 m/s km IV. EXPERIMENTAL RESULTS The deep multiple CPU/GPU collaborative computing based SAR raw data simulation includes three improvements, namely multicore AVX parallel on CPUs, multiple CPU/GPU collaborative simulation framework, and irregular reduction based access conflict elimination on GPU. Through deep collaborative computing and optimized GPU simulation, the overall efficiency should be improved. Five categories of SAR raw data simulation experiments are designed to discuss the acceleration performance of the multicore AVX, multiple CPU/GPU collaborative simulation, irregular reduction optimization on GPU, the impact of introducing errors into simulation, and the accuracy of the proposed method. In order to evaluate the results, four groups of experiments in different data sizes are considered, such as , , , and , which are obtained by a bilinear interpolation of the experimental scene. The employed DEM scene is represented in a slant range coordinate system, and its maximum height is 1200 (m), as shown in Fig. 8. About the hardware environment, two Intel Xeon E5 CPUs, including 24 threads, and two NVIDIA Tesla K10 GPUs, including four GK104 GPUs, are used in the experiments, whose simulation parameters and hardware specifications are listed in Tables I and II. Besides, the software environment consists of four components. Specifically, the operating system is Red Hat Linux 6.5, in which the Intel C++ Composer XE 2015 is employed as the compiler, the Intel MKL library is used for FFT processing of the CPU code, and CUDA 6.5 is selected to drive the GPU parallel computing. Furthermore, OpenMP is employed for 24 threads parallel processing, AVX is exploited for SIMD vector extension, and CUDA is used for 4 GPUs parallel computing. Specifically, in collaborative computing mode, there are 20 threads working for the CPU parallel simulation and 4 threads working for 4 GPUs control. Noted that the sequential single-core CPU method is employed as the baseline program for the speedup test. The running time of the CPU code considers only the core raw data simulation, and the results of the GPU code take into account the input/output time, namely GPU memory allocation and the data transfer between the CPU and GPU, and the core GPU simulation time. A. Performance Analysis on Multicore AVX Acceleration First, we try to evaluate the performance of multicore vector extension technology. Four experiments are executed on a single-core CPU, a six-core CPU using OpenMP and AVX, two

9 ZHANG et al.: DEEP COLLABORATIVE COMPUTING BASED SAR RAW DATA SIMULATION ON MULTIPLE CPU/GPU PLATFORM 395 TABLE III SIMULATION TIME (S)COMPARISON TABLE IV COLLABORATIVE SIMULATION TIME (S)COMPARISON Scene size Serial 1 AVX CPU 2 AVX CPU 1 GPU scene size serial 4 GPU 4 GPU + 1 CPU 4 GPU + 2 CPU Fig. 10. Performance comparison among three HPC configurations. Fig. 9. The acceleration of the multicore AVX parallel and GPU parallel methods. CPUs, and a single GPU, respectively, and the results are shown in Table III. From the results, it can be seen that the multicore vectorization parallel based method improves the efficiency of the sequential method by about 38 speedup per CPU. For one six-core CPU parallel without AVX vectorization, 12 threads are started up by OpenMP. Then, the speedup should be 10.8 if it is assumed that the parallel efficiency is around 90%. Accelerated by the AVX vectorization, namely processing four data items with one instruction, it is expected that there is extra 4 speedup. As the simulation is not completely vectorized by AVX due to the raw data, the expected result should be less than 43.2 speedup. As for the experiment results, they are reasonable and initially prove that there is at least 3 4 speedup when introducing AVX to multicore parallel computing. For the fourth experiment, the simulation time is reduced from 7.7 h in sequential mode to 8.5 min in multicore AVX mode. So, the introduction of multicore and AVX parallels significantly improves the raw data simulation efficiency. Moreover, the performance comparison between AVX and a GPU is the most important topic in this paper, and determines the feasibility of the deep collaborative computing method. There will be no basis for multiple CPU/GPU deep collaborative computing if the computing capability gap between them is too large. Their acceleration performance comparison is illustrated in Table III and Fig. 9. It proves that the two CPU parallel with AVX support is basically close to single GPU computing in terms of performance in all experiments, and can be considered as one GPU for task allocation in collaborative computing. Therefore, the AVX method is a useful complement to the existing multicore CPU parallel computing, even collaborative computing. B. Performance Analysis on Deep Collaborative Computing Method The excellent computing ability of the AVX method lays the foundation of deep multiple CPU/GPU collaborative computing. As the simulation task is not huge enough, the task partition is based on the static strategy, which evenly divides the task into five parts and distributes among GPUs and CPUs. Then, they implement their corresponding subtasks, which will be merged into the whole raw data finally. In most hardware configurations, one multicore CPU is a more realistic scenario. So, the raw data simulation experiments based on four GPUs, four GPUs plus one CPU, and two CPUs are performed to test the collaborative computing effect, the simulation time, and speedup, as listed in Table IV. From the results, it illustrates that the overall simulation efficiency is improved by up to 60, and the four GPUs simulation efficiency is promoted up to 30%. Fig.10 illustrates the SAR raw data simulation speedups for the aforementioned three ways. The conclusion is distinct in that the proposed collaborative computing not only improves the existing multi-gpu-based SAR raw data simulation, but also makes full use of hardware computing resources for higher efficiency. Hence, the deep collaborative computing method is feasible and applicable. In order to construct a fair comparison between the CPU and GPU, another simulation experiment is designed by considering three GPUs, four GPUs, and three GPUs + two CPUs. Through this experiment, the computing capacities of each of those combinations can be further exposed. The results are shown in Table V. From the results, it can be seen that the efficiency of two AVX based CPUs in collaborative computing is better than one GPU in that the parallel efficiency of multi-gpus computing is not high enough. The possible reason is related with the hardware design of the K10 GPU, which has two internal GK104

10 396 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 10, NO. 2, FEBRUARY 2017 TABLE V SIMULATION TIMES (S) COMPARISON scene size 3 GPU 4 GPU 3 GPU + 2 CPU TABLE VII ACCESS CONFLICT OPTIMIZATION SPEEDUP WITH RESPECT TO SEQUENTIAL METHOD scene size Classical GPU Improved GPU Acceleration Fig. 11. TABLE VI TIME(S)ANALYSIS OF GPU MEMORY OPERATIONS scene size memory operation GPU computing The acceleration analysis of irregular reduction algorithm. simulation. The part of returned signal coherent accumulation costs about 80% of the total simulation time in GPU-based simulation. The main reason is the irregularity of coherent accumulation, which is realized by sequential atomic operations and slows down the overall simulation efficiency. Therefore, the optimization for the sequential accumulation with atomic operation is effective and feasible. To deeply analyze the optimization of an irregular reduction algorithm, we deploy a raw data experiment under the condition of a single GPU, and attempt to understand the running time of atomic operations and irregular reduction optimization. In order to enlarge the effect of coherent accumulation, we simulate more target points in one range gate (results shown in Fig. 11 and Table VII), where a classical GPU and an improved GPU indicate the coherent accumulation with atomic operation and irregular reduction, respectively. From the results, it can be seen that the running time of a single GPU is increased with the number of coherent accumulated targets. Also, it is clear that the irregular reduction optimization achieves around 20% speedup, which is in accord with our theoretical analysis. After this optimization, the raw data simulation based on multiple CPU/GPU collaborative computing will reach a higher running efficiency. GPUs. If four separate GPUs are employed, such as four K20 GPUs, this parallel efficiency issue will be improved. In the aforementioned calculation of simulation time, the GPU part takes into account the memory operations, not the CPU part. It is necessary that the running time of memory operations should be demonstrated to further understand these two kinds of parallelism. Although the memory operations include the GPU memory allocation, data transfer, and memory release, there is only a bit of elapsed time. Considering the fourth experiment, for example, the data transfer volume mainly includes the position of platforms and targets, scattering coefficients of targets, and the raw data, and is no more than 1 gigabyte (GB). The bandwidth of CPU GPU is about 8 GB/s, so the theoretical transfer time is below 1 s. The elapsed times of memory operation and pure GPU computing are shown in Table VI. It is shown that the ratio of data transfer timer is below 1% with increase of data volume. Therefore, the data transfer is not a major factor in the GPU-based SAR raw data simulation, and brings a very weak impact on collaborative computing. C. Performance Analysis on Irregular Reduction Optimization As the previous analysis, the coherent accumulation is the main time-consuming part of the whole time-domain raw data D. Performance Analysis on Operative Condition Simulation Time-domain simulation is able to exactly account for the effects of sensor trajectory deviations, but its high computational complexity allows one to only deal with imaged scenes consisting of one or a few point scatterers [4]. Strengthened by HPC technologies, the time-domain algorithm is capable of simulating more scatterers and even an extended scene. Nevertheless, its simulation efficiency is still less than the frequency-domain algorithm. The key advantage of the time-domain algorithm is to easily take into account the systematic errors and motion error, which are constant in each azimuth time. Hence, these errors can be calculated in advance, and are introduced into the raw data simulation by simple additions or multiplications in each azimuth time, which will not slow down the efficiency too much. Furthermore, in raw data simulation for a practical mission, the channel errors of transceiver hardware always come from the real data, which should be transferred to memory for simulation computing. Compared with the theoretical simulation, the increased calculation time of operative condition simulation is mainly reflected in the data transfer and arithmetic operations. Therefore, two experiments considering the trajectory deviation and system channel error are designed to analyze their impact on the simulation efficiency, and the results are shown in

11 ZHANG et al.: DEEP COLLABORATIVE COMPUTING BASED SAR RAW DATA SIMULATION ON MULTIPLE CPU/GPU PLATFORM 397 TABLE VIII SIMULATION TIME(S)COMPARISON BETWEEN IDEAL AND OPERATIVE CONDITION SIMULATION scene size Trajectory Deviation Channel Error Ideal Fig. 12. Imaging result of multiple CPU/GPU collaborative simulated raw data for point target embedded scene. TABLE IX SIMULATION ACCURACY FOR EMBEDDED POINT TARGET Res Coef PSLR ISLR Azimuth Range Table VIII. As these errors are calculated in advance, different error models have the same introducing process, namely the error transfer and error additions. Hence, two simplified models are employed to generate these errors. For the space-borne SAR trajectory deviation simulation, the 0.1(m) fixed deviations of the three axes are directly added to the coordinates of the platform. As for the channel error simulation, random phase errors are directly added to the phase of each accumulated return signal. Compared with the ideal simulation, the operative condition simulation increases the running time a little, which proves its merit in considering the space-varying and time-varying errors. E. Accuracy Analysis on Deep Collaborative Computing Method After discussing the acceleration of the deep collaborative computing method, an accuracy analysis of the method is given in this section. An experiment embedded point target into the scene is designed to discuss its simulation accuracy for the point target and the extended scene. We first calculate the scattering coefficient map by considering the SAR geometry and speckle noise, then fill the left-up area of with zero, and put one point target in the middle of this area. This experiment includes three steps, namely collaborative computing based raw data simulation for the design scene, imaging with the Chirp Scaling algorithm, and image quality assessment. The resolution (Res), expansion coefficient (Coef), peak-side lobe ratio (PSLR), and integral-side lobe ratio (ISLR) are used for SAR image assessment, which are also the indicators of the deep collaborative computing based raw data simulation accuracy. The imaging result is shown in Fig. 12. We can see that the image result is in accord with the input DEM data, and the point target is also accurately focusing. The embedded point target quality assessment is carried out, and the results are listed in Table IX. From these results, we can conclude that the deep collaborative computing approach can guarantee the simulation accuracy, and can be applied to other SAR related computing issues. Through the above discussion, we summarize that the multicore vector extension based method brings new vitality to the existing CPU parallel hierarchy, and boosts the computing power of a CPU to bring it at par with a GPU. Then, the irregular reduction algorithm is an effective solution in eliminating the memory access conflicts caused by multiple GPU threads waiting, and greatly improves the parallel efficiency on a GPU. Finally, the deep collaborative computing of multiple hardware resources can obviously shorten the simulation period, and should be applied to practical applications, such as SAR system parameters design and imaging algorithm verification for a new operating mode. Therefore, the deep multiple CPU/GPU collaborative computing based SAR raw data simulation is particularly suitable for the case of a huge computation, especially the large-areas raw data simulation. V. CONCLUSION In this paper, we exploited the multiple CPU/GPU collaborative computing to solve the calculation bottleneck of SAR raw data simulation for a large area. We proposed a multiple SIMD CPU/GPU collaborative parallel based time-domain SAR raw data parallel simulation method. More specifically, three improvements were introduced: the first one was the multicore vector extension method, which not only greatly improved the computing power of a CPU and prompted the possibility of collaborative computing, but also made the CPU work under a similar SIMD framework to that of a GPU and realized the deep collaboration; the second one was the deep collaborative computing framework of multiple computing resources, which improved the simulation efficiency to be more than the popular multi- GPU simulation method; and the third one was the irregular

12 398 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 10, NO. 2, FEBRUARY 2017 reduction based access conflict elimination, which changed the irregular accumulation issue to a regular issue with regularization processing and achieved a satisfied speedup over the basic GPU method. The experimental results showed that the multicore vectorization method with two CPUs was close to the classical single GPU method, and boosted the singlecore CPU simulation efficiency by over 70 speedup, thus the collaborative computing achieving about 250 speedup. Besides, the irregular reduction method achieved 20% acceleration compared with the classical GPU simulation. Furthermore, the collaborative method improved the simulation efficiency through exploiting the idle CPU computing resource and GPU optimization, having advantages of energy saving and low hardware cost. The proposed method has been verified to be suitable for the airborne/space-borne SAR simulation, and is expected a better application in PolSAR or InSAR as the multidimensional data processing can be more easily partitioned along multiple dimensions and mapped to the deep collaborative computing framework of multiple resources. The future work of this research will apply the deep multiple CPU/GPU collaborative computing method to SAR imaging, and SAR target recognition and classification, and will conduct a preliminary attempt for the real-time processing and application of SAR products. REFERENCES [1] A. Moreira, P. Prats-iraola, M. Younis, G. Krieger, I. Hajnsek, and K. Papathanassiou, A tutorial on synthetic aperture radar, IEEE Geosci. Remote Sens. Mag., vol. 1, no. 1, pp. 6 43, Mar [2] G. Franceschetti, R. Guida, A. Iodice, D. Riccio, and G. Ruello, Efficient simulation of hybrid Stripmap/Spotlight SAR raw signals from extended scenes, IEEE Trans. Geosci. Remote Sens.,vol.42,no.11,pp , Nov [3] G. Franceschetti, M. Migliaccio, and D. Riccio, SAR simulation: An overview, in Proc. IEEE Geosci. Remote Sens. Symp., 1995, pp [4] E. Boerner, R. Lord, J. Mittermayer, and R. Bamler, Evaluation of TerraSAR-X spotlight processing accuracy based on a new spotlight raw data simulator, in Proc. IEEE Geosci. Remote Sens. Symp., 2003, pp [5] A. Mori and F. D. Vita, A time-domain raw signal simulator for interferometric SAR, IEEE Trans. Geosci. Remote Sens., vol. 42, no. 9, pp , Sep [6] G. Franceschetti, M. Migliaccio, D. Riccio, and S. Gilda, SARAS: A synthetic aperture radar (SAR) raw signal simulator, IEEE Trans. Geosci. Remote Sens., vol. 30, no. 1, pp , Jan [7] K. Eldhuset, Raw signal simulation for very high resolution SAR based on polarimetric scattering theory, in Proc. IEEE Geosci. Remote Sens. Symp., 2004, vol. 3, pp [8] X. Qiu, D. Hu, L. Zhou, and C. Ding, A bistatic SAR raw data simulator basedoninverseω-κ algorithm, IEEE Trans. Geosci. Remote Sens., vol. 48, no. 3, pp , Mar [9] A. Khwaja, L. Ferro-Famil, and E. Pottier, SAR raw data generation using inverse SAR image formation algorithms, in Proc. IEEE Geosci. Remote Sens. Symp., 2006, pp [10] B. Deng, Y. Qing, H. Wang, X. Li, and Y. Li, Inverse frequency scaling algorithm (IFSA) for SAR raw data simulation, in Proc. Int. Conf. Signal Process. Syst., 2010, vol. 2, pp [11] G. Franceschetti, M. Migliaccio, and D. Riccio, SAR raw signal simulation of actual ground sites described in terms of sparse input data, IEEE Trans. Geosci. Remote Sens., vol. 32, no. 6, pp , Nov [12] G. Franceschetti, A. Iodice, M. Migliaccio, and D. Riccio, A novel acrosstrack SAR interferometry simulator, IEEE Trans. Geosci. Remote Sens., vol. 36, no. 3, pp , May [13] S. Cimmino, G. Franceschetti, A. Iodice, D. Riccio, and G. Ruello, Efficient spotlight SAR raw signal simulation of extended scenes, IEEE Trans. Geosci. Remote Sens., vol. 41, no. 10, pp , Oct [14] G. Franceschetti, A. Iodice, S. Perna, and D. Riccio, SAR sensor trajectory deviations: Fourier domain formulation and extended scene simulation of raw signal, IEEE Trans. Geosci. Remote Sens., vol. 44, no. 9, pp , Sep [15] O. Dogan and M. Kartal, Efficient stripmap-mode SAR raw data simulation including platform angular deviations, IEEE Geosci. Remote Sens. Lett., vol. 8, no. 4, pp , Jul [16] A. Khwaja, L. Ferro-Famil, and E. Pottier, Efficient stripmap SAR raw data generation taking into account sensor trajectory deviations, IEEE Geosci. Remote Sens. Lett., vol. 8, no. 4, pp , Jul [17] Y. Su and X. Qi, OpenMP based space-borne SAR raw signal parallel simulation, J. Graduate School Chin. Acad. Sci., vol. 25, no. 1, pp , [18] X. Wang, L. Huang, and Z. Wang, Research on parallel arithmetic of distribute spaceborne SAR ground target simulation, J. Syst. Simul., vol.18, no. 8, pp , [19] F. Zhang, Y. Lin, and W. Hong, SAR echo distributed simulation based on grid computing, J. Syst. Simul., vol. 20, no. 12, pp , [20] B. Wang, F. Zhang, and M. Xiang, SAR raw signal simulation based on GPU parallel computation, in Proc. IEEE Geosci. Remote Sens. Symp., 2009, vol. 4, pp [21] F. Zhang, B. Wang, and M. Xiang, Accelerating InSAR raw data simulation on GPU using CUDA, in Proc. IEEE Geosci. Remote Sens. Symp., 2010, pp [22] F. Zhang, Z. Li, B. Wang, M. Xiang, and W. Hong, Hybrid generalpurpose computation on GPU (GPGPU) and computer graphics synthetic aperture radar simulation for complex scenes, Int. J. Physical Sci.,vol.7, no. 8, pp , [23] F. Zhang, C. Hu, W. Li, W. Hu, and H. Li, Accelerating time-domain SAR raw data simulation for large areas using multi-gpus, IEEE J. Select. Topics Appl. Earth Observ. Remote Sens., vol. 7, no. 9, pp , Sep [24] A. Acosta, V. Blanco, and F. Almeida, Dynamic load balancing on heterogeneous multi-gpu systems, Comput. Elect. Eng., vol. 39, no. 8, pp , [25] E. Torti, G. Danese, F. Leporati, and A. Plaza, A hybrid CPU-GPU realtime hyperspectral unmixing chain, IEEE J. Select. Topics Appl. Earth Observ. Remote Sens., vol. 9, no. 2, pp , Feb [26] A. Marowka, Analytical modeling of energy efficiency in heterogeneous processors, Comput. Elect. Eng., vol. 39, no. 8, pp , [27] E. Zhu, R. Ma, Y. Hou, Y. Yang, F. Liu, and H. Guan, Two-phase execution of binary applications on CPU/GPU machines, Comput. Elect. Eng., vol. 40, no. 5, pp , [28] S. Hong, T. Oguntebi, and K. Olukotun, Efficient parallel graph exploration on multi-core CPU and GPU, in Proc. Int. Conf. Parallel Archit. Compilation Techn., Oct. 2011, pp [29] Y. Yang, P. Xiang, M. Mantor, and H. Zhou, CPU-assisted GPGPU on fused CPU-GPU architectures, in Proc. IEEE 18th Int. Symp. High Perform. Comput. Archit., Feb. 2012, pp [30] R. Duarte, Improving performance of data-parallel applications on CPU- GPU heterogeneous systems, Master s thesis, Dept. Elect. Eng., Univ. Rhode Island, Kingston, RI, USA, [31] X. Wang, Y. Zhang, and S. Ding, A high performance FFT library with single instruction multiple data (SIMD) architecture, in Proc. Int. Conf. Electron., Commun. Control, Sep. 2011, pp [32] L. Zhang, X. Yang, and W. Yu, Acceleration study for the FDTD method using SSE and AVX instructions, in Proc. Int. Conf. Consumer Electron., Commun. Netw., Apr. 2012, pp [33] P. Esterie, M. Gaunard, J. Falcou, J.-T. Lapreste, and B. Rozoy, Boost.SIMD: Generic programming for portable simdization, in Proc. Int. Conf. Parallel Archit. Compilation Techn., 2012, pp [34] P. Wu, T. Yan, H. Zhang, and M. D. F. Wong, Efficient aerial image simulation on multi-core SIMD CPU, in Proc. IEEE/ACM Int. Conf. Comput.-Aided Des., San Jose, CA, USA, Nov. 2013, pp [35] P. Kristof, H. Yu, Z. Li, and X. Tian, Performance study of SIMD programming models on intel multicore processors., in Proc. IEEE Int. Parallel Distrib. Process. Symp. Workshops, May 2012, pp [36] W. Hwu, What is ahead for parallel computing, J. Parallel Distrib. Comput., vol. 74, no. 7, pp , Jul [37] X. Wu, N. Obeid, and W. Hwu, Exploiting more parallelism from applications having generalized reductions on GPU architectures, in Proc. IEEE 10th Int. Conf. Comput. Inform. Technol., 2010, pp

13 ZHANG et al.: DEEP COLLABORATIVE COMPUTING BASED SAR RAW DATA SIMULATION ON MULTIPLE CPU/GPU PLATFORM 399 [38] R. Nasre, M. Burtscher, and K. Pingali, Atomic-free irregular computations on GPUs, in Proc. 6th Workshop Gen. Purpose Processor Using Graph. Process. Units, 2013, pp [39] B. Dhanasekaran and N. Rubin, A new method for GPU based irregular reductions and its application to K-means clustering, in Proc. 4th Workshop Gen. Purpose Processor Using Graph. Process. Units, 2011, pp [40] X. Huo, V. T. Ravi, and G. Agrawal, Porting irregular reductions on heterogeneous CPU-GPU configurations, in Proc. 18th Int. Conf. High Perform. Comput., 2011, pp [41] C. Wu, K. Liu, and M. Jin, Modeling and a correlation algorithm for spaceborne SAR signals, IEEE Trans. Aerosp. Electron. Syst., vol. 18, no. 5, pp , Sep Wei Hu received the B.S. and M.S. degrees in computer science from Dalian University of Science and Technology, Dalian, China, in 1999 and 2002, respectively, and the Ph.D. degree in computer science from Tsinghua University, Beijing, China, in He is currently an Associate Professor of computer science at Beijing University of Chemical Technology, Beijing. His research interests include computer graphics, computational photography, and scientific visualization. Fan Zhang (S 07 M 10) received the B.E. degree in communication engineering from the Civil Aviation University of China, Tianjin, China, in 2002, the M.S. degree in signal and information processing from Beihang University, Beijing, China, in 2005, and the Ph.D. degree in signal and information processing from the Institute of Electronics, Chinese Academy of Science, Beijing, in He is currently an Associate Professor of electronic and information engineering at Beijing University of Chemical Technology, Beijing. His research interests include synthetic aperture radar signal processing, high performance computing, and scientific visualization. Dr. Zhang has been a Reviewer for the IEEE TRANSACTIONS ON GEO- SCIENCE AND REMOTE SENSING, the IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, the IEEE GEOSCIENCE AND REMOTE SENSING LETTERS,andtheInternational Journal of Antennas and Propagation. Pengbo Wang received the Ph.D. degree in information and communication engineering from Beihang University (Beijing University of Aeronautics and Astronautics, BUAA), Beijing, China, in He was a Postdoctoral Researcher with the School of Electronics and Information Engineering, Beihang University. From 2010 to 2015, he was a Lecturer in the School of Electronics and Information Engineering, Beihang University. Since July 2015, he has been an Associate Professor in the School of Electronics and Information Engineering, Beihang University. He has authored and coauthored more than 40 journal and conference publications, and he is the holder of ten patents in the field of microwave remote sensing. His current research interests include high-resolution space-borne SAR image formation, novel techniques for space-borne SAR systems, and multimodal remote sensing data fusion. Chen Hu (S 15) received the B.E. degree in electronic and information engineering from Beijing University of Chemical Technology, Beijing, China, in 2013, where he is currently working toward the M.S. degree in the field of high performance computing. His research interests include parallel computing, distributed computing, and imaging processing. Wei Li (S 11 M 13) received the B.E. degree in telecommunications engineering from Xidian University, Xi an, China, in 2007, the M.S. degree in information science and technology from Sun Yat-Sen University, Guangzhou, China, in 2009, and the Ph.D. degree in electrical and computer engineering from Mississippi State University, Starkville, MS, USA, in Subsequently, he spent one year as a Postdoctoral Researcher at the University of California, Davis, CA, USA. He is currently in the College of Information Science and Technology, Beijing University of Chemical Technology, Beijing, China. His research interests include statistical pattern recognition, hyperspectral image analysis, and data compression. Dr. Li is an active Reviewer for the IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, the IEEE GEOSCIENCE REMOTE SENSING LETTERS,and the IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING (JSTARS). He received the 2015 Best Reviewer Award from the IEEE Geoscience and Remote Sensing Society for his service for the IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING. Heng-Chao Li (S 06 M 08 SM 14) received the B.Sc. and M.Sc. degrees from Southwest Jiaotong University, Chengdu, China, in 2001 and 2004, respectively, and the Ph.D. degree from the Graduate University of the Chinese Academy of Sciences, Beijing, China, in 2008, all in information and communication engineering. He is currently a Professor with the Sichuan Provincial Key Laboratory of Information Coding and Transmission, Southwest Jiaotong University. Since November 2013, he has been a Visiting Scholar working with Prof. W. J. Emery at the University of Colorado Boulder, Boulder, CO, USA. His research interests include statistical analysis and processing of synthetic aperture radar images, and signal processing in communications. Dr. Li received several scholarships or awards, including the Special Grade of the Financial Support from the China Postdoctoral Science Foundation in 2009 and the New Century Excellent Talents in University from the Ministry of Education of China in In addition, he has also been a Reviewer for several international journals and conferences, such as the IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, the IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, the IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, the IEEE TRANSACTIONS ON IMAGE PROCESS- ING, IET Radar, Sonar & Navigation, andthecanadian Journal of Remote Sensing. He is currently serving as an Associate Editor of the IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING.

Using CUDA to Accelerate Radar Image Processing

Using CUDA to Accelerate Radar Image Processing Using CUDA to Accelerate Radar Image Processing Aaron Rogan Richard Carande 9/23/2010 Approved for Public Release by the Air Force on 14 Sep 2010, Document Number 88 ABW-10-5006 Company Overview Neva Ridge

More information

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering

More information

A Fast Speckle Reduction Algorithm based on GPU for Synthetic Aperture Sonar

A Fast Speckle Reduction Algorithm based on GPU for Synthetic Aperture Sonar Vol.137 (SUComS 016), pp.8-17 http://dx.doi.org/1457/astl.016.137.0 A Fast Speckle Reduction Algorithm based on GPU for Synthetic Aperture Sonar Xu Kui 1, Zhong Heping 1, Huang Pan 1 1 Naval Institute

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Synthetic Aperture Radar Modeling using MATLAB and Simulink

Synthetic Aperture Radar Modeling using MATLAB and Simulink Synthetic Aperture Radar Modeling using MATLAB and Simulink Naivedya Mishra Team Lead Uurmi Systems Pvt. Ltd. Hyderabad Agenda What is Synthetic Aperture Radar? SAR Imaging Process Challenges in Design

More information

Parallel FFT Program Optimizations on Heterogeneous Computers

Parallel FFT Program Optimizations on Heterogeneous Computers Parallel FFT Program Optimizations on Heterogeneous Computers Shuo Chen, Xiaoming Li Department of Electrical and Computer Engineering University of Delaware, Newark, DE 19716 Outline Part I: A Hybrid

More information

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE Abu Asaduzzaman, Deepthi Gummadi, and Chok M. Yip Department of Electrical Engineering and Computer Science Wichita State University Wichita, Kansas, USA

More information

Accelerating Implicit LS-DYNA with GPU

Accelerating Implicit LS-DYNA with GPU Accelerating Implicit LS-DYNA with GPU Yih-Yih Lin Hewlett-Packard Company Abstract A major hindrance to the widespread use of Implicit LS-DYNA is its high compute cost. This paper will show modern GPU,

More information

Trends in HPC (hardware complexity and software challenges)

Trends in HPC (hardware complexity and software challenges) Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18

More information

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS Ferdinando Alessi Annalisa Massini Roberto Basili INGV Introduction The simulation of wave propagation

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware

More information

Mobile Cloud Multimedia Services Using Enhance Blind Online Scheduling Algorithm

Mobile Cloud Multimedia Services Using Enhance Blind Online Scheduling Algorithm Mobile Cloud Multimedia Services Using Enhance Blind Online Scheduling Algorithm Saiyad Sharik Kaji Prof.M.B.Chandak WCOEM, Nagpur RBCOE. Nagpur Department of Computer Science, Nagpur University, Nagpur-441111

More information

GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC

GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC MIKE GOWANLOCK NORTHERN ARIZONA UNIVERSITY SCHOOL OF INFORMATICS, COMPUTING & CYBER SYSTEMS BEN KARSIN UNIVERSITY OF HAWAII AT MANOA DEPARTMENT

More information

The Staggered SAR Concept: Imaging a Wide Continuous Swath with High Resolution

The Staggered SAR Concept: Imaging a Wide Continuous Swath with High Resolution The Staggered SAR Concept: Imaging a Wide Continuous Swath with High Resolution Michelangelo Villano *, Gerhard Krieger *, Alberto Moreira * * German Aerospace Center (DLR), Microwaves and Radar Institute

More information

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh

More information

How to Apply the Geospatial Data Abstraction Library (GDAL) Properly to Parallel Geospatial Raster I/O?

How to Apply the Geospatial Data Abstraction Library (GDAL) Properly to Parallel Geospatial Raster I/O? bs_bs_banner Short Technical Note Transactions in GIS, 2014, 18(6): 950 957 How to Apply the Geospatial Data Abstraction Library (GDAL) Properly to Parallel Geospatial Raster I/O? Cheng-Zhi Qin,* Li-Jun

More information

6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS

6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS Chapter 6 Indexing Results 6. INTRODUCTION The generation of inverted indexes for text databases is a computationally intensive process that requires the exclusive use of processing resources for long

More information

implementation using GPU architecture is implemented only from the viewpoint of frame level parallel encoding [6]. However, it is obvious that the mot

implementation using GPU architecture is implemented only from the viewpoint of frame level parallel encoding [6]. However, it is obvious that the mot Parallel Implementation Algorithm of Motion Estimation for GPU Applications by Tian Song 1,2*, Masashi Koshino 2, Yuya Matsunohana 2 and Takashi Shimamoto 1,2 Abstract The video coding standard H.264/AVC

More information

Performance potential for simulating spin models on GPU

Performance potential for simulating spin models on GPU Performance potential for simulating spin models on GPU Martin Weigel Institut für Physik, Johannes-Gutenberg-Universität Mainz, Germany 11th International NTZ-Workshop on New Developments in Computational

More information

WHY PARALLEL PROCESSING? (CE-401)

WHY PARALLEL PROCESSING? (CE-401) PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Finite Element Integration and Assembly on Modern Multi and Many-core Processors Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,

More information

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information

An FPGA Implementation of the Powering Function with Single Precision Floating-Point Arithmetic

An FPGA Implementation of the Powering Function with Single Precision Floating-Point Arithmetic An FPGA Implementation of the Powering Function with Single Precision Floating-Point Arithmetic Pedro Echeverría, Marisa López-Vallejo Department of Electronic Engineering, Universidad Politécnica de Madrid

More information

A paralleled algorithm based on multimedia retrieval

A paralleled algorithm based on multimedia retrieval A paralleled algorithm based on multimedia retrieval Changhong Guo Teaching and Researching Department of Basic Course, Jilin Institute of Physical Education, Changchun 130022, Jilin, China Abstract With

More information

Yunfeng Zhang 1, Huan Wang 2, Jie Zhu 1 1 Computer Science & Engineering Department, North China Institute of Aerospace

Yunfeng Zhang 1, Huan Wang 2, Jie Zhu 1 1 Computer Science & Engineering Department, North China Institute of Aerospace [Type text] [Type text] [Type text] ISSN : 0974-7435 Volume 10 Issue 20 BioTechnology 2014 An Indian Journal FULL PAPER BTAIJ, 10(20), 2014 [12526-12531] Exploration on the data mining system construction

More information

Performance Analysis and Optimization of Gyrokinetic Torodial Code on TH-1A Supercomputer

Performance Analysis and Optimization of Gyrokinetic Torodial Code on TH-1A Supercomputer Performance Analysis and Optimization of Gyrokinetic Torodial Code on TH-1A Supercomputer Xiaoqian Zhu 1,2, Xin Liu 1, Xiangfei Meng 2, Jinghua Feng 2 1 School of Computer, National University of Defense

More information

Online Course Evaluation. What we will do in the last week?

Online Course Evaluation. What we will do in the last week? Online Course Evaluation Please fill in the online form The link will expire on April 30 (next Monday) So far 10 students have filled in the online form Thank you if you completed it. 1 What we will do

More information

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS Computer Science 14 (4) 2013 http://dx.doi.org/10.7494/csci.2013.14.4.679 Dominik Żurek Marcin Pietroń Maciej Wielgosz Kazimierz Wiatr THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT

More information

Two-Phase flows on massively parallel multi-gpu clusters

Two-Phase flows on massively parallel multi-gpu clusters Two-Phase flows on massively parallel multi-gpu clusters Peter Zaspel Michael Griebel Institute for Numerical Simulation Rheinische Friedrich-Wilhelms-Universität Bonn Workshop Programming of Heterogeneous

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

PhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea.

PhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea. Abdulrahman Manea PhD Student Hamdi Tchelepi Associate Professor, Co-Director, Center for Computational Earth and Environmental Science Energy Resources Engineering Department School of Earth Sciences

More information

TR An Overview of NVIDIA Tegra K1 Architecture. Ang Li, Radu Serban, Dan Negrut

TR An Overview of NVIDIA Tegra K1 Architecture. Ang Li, Radu Serban, Dan Negrut TR-2014-17 An Overview of NVIDIA Tegra K1 Architecture Ang Li, Radu Serban, Dan Negrut November 20, 2014 Abstract This paperwork gives an overview of NVIDIA s Jetson TK1 Development Kit and its Tegra K1

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

Adaptive Doppler centroid estimation algorithm of airborne SAR

Adaptive Doppler centroid estimation algorithm of airborne SAR Adaptive Doppler centroid estimation algorithm of airborne SAR Jian Yang 1,2a), Chang Liu 1, and Yanfei Wang 1 1 Institute of Electronics, Chinese Academy of Sciences 19 North Sihuan Road, Haidian, Beijing

More information

A Matrix--Matrix Multiplication methodology for single/multi-core architectures using SIMD

A Matrix--Matrix Multiplication methodology for single/multi-core architectures using SIMD A Matrix--Matrix Multiplication methodology for single/multi-core architectures using SIMD KELEFOURAS, Vasileios , KRITIKAKOU, Angeliki and GOUTIS, Costas Available

More information

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved. Chapter 6 Parallel Processors from Client to Cloud FIGURE 6.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. 2 FIGURE

More information

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters 1 On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters N. P. Karunadasa & D. N. Ranasinghe University of Colombo School of Computing, Sri Lanka nishantha@opensource.lk, dnr@ucsc.cmb.ac.lk

More information

GPU Implementation of a Multiobjective Search Algorithm

GPU Implementation of a Multiobjective Search Algorithm Department Informatik Technical Reports / ISSN 29-58 Steffen Limmer, Dietmar Fey, Johannes Jahn GPU Implementation of a Multiobjective Search Algorithm Technical Report CS-2-3 April 2 Please cite as: Steffen

More information

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

The Comparative Study of Machine Learning Algorithms in Text Data Classification* The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification

More information

Building NVLink for Developers

Building NVLink for Developers Building NVLink for Developers Unleashing programmatic, architectural and performance capabilities for accelerated computing Why NVLink TM? Simpler, Better and Faster Simplified Programming No specialized

More information

A real time SAR processor implementation with FPGA

A real time SAR processor implementation with FPGA Computational Methods and Experimental Measurements XV 435 A real time SAR processor implementation with FPGA C. Lesnik, A. Kawalec & P. Serafin Institute of Radioelectronics, Military University of Technology,

More information

Using GPUs to compute the multilevel summation of electrostatic forces

Using GPUs to compute the multilevel summation of electrostatic forces Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of

More information

Design, Implementation and Performance Evaluation of Synthetic Aperture Radar Signal Processor on FPGAs

Design, Implementation and Performance Evaluation of Synthetic Aperture Radar Signal Processor on FPGAs Design, Implementation and Performance Evaluation of Synthetic Aperture Radar Signal Processor on FPGAs Hemang Parekh Masters Thesis MS(Computer Engineering) University of Kansas 23rd June, 2000 Committee:

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

PSI Precision, accuracy and validation aspects

PSI Precision, accuracy and validation aspects PSI Precision, accuracy and validation aspects Urs Wegmüller Charles Werner Gamma Remote Sensing AG, Gümligen, Switzerland, wegmuller@gamma-rs.ch Contents Aim is to obtain a deeper understanding of what

More information

A GPU Based Memory Optimized Parallel Method For FFT Implementation

A GPU Based Memory Optimized Parallel Method For FFT Implementation A GPU Based Memory Optimized Parallel Method For FFT Implementation Fan Zhang a,, Chen Hu a, Qiang Yin a, Wei Hu a arxiv:1707.07263v1 [cs.dc] 23 Jul 2017 a College of Information Science and Technology,

More information

Open Access Image Based Virtual Dimension Compute Unified Device Architecture of Parallel Processing Technology

Open Access Image Based Virtual Dimension Compute Unified Device Architecture of Parallel Processing Technology Send Orders for Reprints to reprints@benthamscience.ae 1592 The Open Automation and Control Systems Journal, 2015, 7, 1592-1596 Open Access Image Based Virtual Dimension Compute Unified Device Architecture

More information

high performance medical reconstruction using stream programming paradigms

high performance medical reconstruction using stream programming paradigms high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming

More information

High performance networked computing in media, services and information management

High performance networked computing in media, services and information management J Supercomput (2013) 64:830 834 DOI 10.1007/s11227-013-0942-7 High performance networked computing in media, services and information management Gang Kou Published online: 24 April 2013 Springer Science+Business

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

IMAGING WITH SYNTHETIC APERTURE RADAR

IMAGING WITH SYNTHETIC APERTURE RADAR ENGINEERING SCIENCES ; t rical Bngi.net IMAGING WITH SYNTHETIC APERTURE RADAR Didier Massonnet & Jean-Claude Souyris EPFL Press A Swiss academic publisher distributed by CRC Press Table of Contents Acknowledgements

More information

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation 1 Cheng-Han Du* I-Hsin Chung** Weichung Wang* * I n s t i t u t e o f A p p l i e d M

More information

Optimal Configuration of Compute Nodes for Synthetic Aperture Radar Processing

Optimal Configuration of Compute Nodes for Synthetic Aperture Radar Processing Optimal Configuration of Compute Nodes for Synthetic Aperture Radar Processing Jeffrey T. Muehring and John K. Antonio Deptartment of Computer Science, P.O. Box 43104, Texas Tech University, Lubbock, TX

More information

A SPECTRAL ANALYSIS OF SINGLE ANTENNA INTERFEROMETRY. Craig Stringham

A SPECTRAL ANALYSIS OF SINGLE ANTENNA INTERFEROMETRY. Craig Stringham A SPECTRAL ANALYSIS OF SINGLE ANTENNA INTERFEROMETRY Craig Stringham Microwave Earth Remote Sensing Laboratory Brigham Young University 459 CB, Provo, UT 84602 March 18, 2013 ABSTRACT This paper analyzes

More information

Several imaging algorithms for synthetic aperture sonar and forward looking gap-filler in real-time and post-processing on IXSEA s Shadows sonar

Several imaging algorithms for synthetic aperture sonar and forward looking gap-filler in real-time and post-processing on IXSEA s Shadows sonar Several imaging algorithms for synthetic aperture sonar and forward looking gap-filler in real-time and post-processing on IXSEA s Shadows sonar F. Jean IXSEA, 46, quai François Mitterrand, 13600 La Ciotat,

More information

Plane Wave Imaging Using Phased Array Arno Volker 1

Plane Wave Imaging Using Phased Array Arno Volker 1 11th European Conference on Non-Destructive Testing (ECNDT 2014), October 6-10, 2014, Prague, Czech Republic More Info at Open Access Database www.ndt.net/?id=16409 Plane Wave Imaging Using Phased Array

More information

High performance Computing and O&G Challenges

High performance Computing and O&G Challenges High performance Computing and O&G Challenges 2 Seismic exploration challenges High Performance Computing and O&G challenges Worldwide Context Seismic,sub-surface imaging Computing Power needs Accelerating

More information

New Results on the Omega-K Algorithm for Processing Synthetic Aperture Radar Data

New Results on the Omega-K Algorithm for Processing Synthetic Aperture Radar Data New Results on the Omega-K Algorithm for Processing Synthetic Aperture Radar Data Matthew A. Tolman and David G. Long Electrical and Computer Engineering Dept. Brigham Young University, 459 CB, Provo,

More information

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization Duksu Kim Assistant professor, KORATEHC Education Ph.D. Computer Science, KAIST Parallel Proximity Computation on Heterogeneous Computing Systems for Graphics Applications Professional Experience Senior

More information

The p196 mpi implementation of the reverse-and-add algorithm for the palindrome quest.

The p196 mpi implementation of the reverse-and-add algorithm for the palindrome quest. The p196 mpi implementation of the reverse-and-add algorithm for the palindrome quest. Romain Dolbeau March 24, 2014 1 Introduction To quote John Walker, the first person to brute-force the problem [1]:

More information

Optimization of Cone Beam CT Reconstruction Algorithm Based on CUDA

Optimization of Cone Beam CT Reconstruction Algorithm Based on CUDA Sensors & Transducers 2013 by IFSA http://www.sensorsportal.com Optimization of Cone Beam CT Reconstruction Algorithm Based on CUDA 1 Wang LI-Fang, 2 Zhang Shu-Hai 1 School of Electronics and Computer

More information

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Hybrid Computing @ KAUST Many Cores and OpenACC Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Agenda Hybrid Computing n Hybrid Computing n From Multi-Physics

More information

A Novel Optimization Method of Optical Network Planning. Wu CHEN 1, a

A Novel Optimization Method of Optical Network Planning. Wu CHEN 1, a A Novel Optimization Method of Optical Network Planning Wu CHEN 1, a 1 The engineering & technical college of chengdu university of technology, leshan, 614000,china; a wchen_leshan@126.com Keywords:wavelength

More information

When MPPDB Meets GPU:

When MPPDB Meets GPU: When MPPDB Meets GPU: An Extendible Framework for Acceleration Laura Chen, Le Cai, Yongyan Wang Background: Heterogeneous Computing Hardware Trend stops growing with Moore s Law Fast development of GPU

More information

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004 A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into

More information

Load Balancing Algorithm over a Distributed Cloud Network

Load Balancing Algorithm over a Distributed Cloud Network Load Balancing Algorithm over a Distributed Cloud Network Priyank Singhal Student, Computer Department Sumiran Shah Student, Computer Department Pranit Kalantri Student, Electronics Department Abstract

More information

Parallel Auction Algorithm for Linear Assignment Problem

Parallel Auction Algorithm for Linear Assignment Problem Parallel Auction Algorithm for Linear Assignment Problem Xin Jin 1 Introduction The (linear) assignment problem is one of classic combinatorial optimization problems, first appearing in the studies on

More information

John W. Romein. Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands

John W. Romein. Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands Signal Processing on GPUs for Radio Telescopes John W. Romein Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands 1 Overview radio telescopes six radio telescope algorithms on

More information

Fusion of Radar and EO-sensors for Surveillance

Fusion of Radar and EO-sensors for Surveillance of Radar and EO-sensors for Surveillance L.J.H.M. Kester, A. Theil TNO Physics and Electronics Laboratory P.O. Box 96864, 2509 JG The Hague, The Netherlands kester@fel.tno.nl, theil@fel.tno.nl Abstract

More information

Memorandum. Clint Slatton Prof. Brian Evans Term project idea for Multidimensional Signal Processing (EE381k)

Memorandum. Clint Slatton Prof. Brian Evans Term project idea for Multidimensional Signal Processing (EE381k) Memorandum From: To: Subject: Date : Clint Slatton Prof. Brian Evans Term project idea for Multidimensional Signal Processing (EE381k) 16-Sep-98 Project title: Minimizing segmentation discontinuities in

More information

Visible and Long-Wave Infrared Image Fusion Schemes for Situational. Awareness

Visible and Long-Wave Infrared Image Fusion Schemes for Situational. Awareness Visible and Long-Wave Infrared Image Fusion Schemes for Situational Awareness Multi-Dimensional Digital Signal Processing Literature Survey Nathaniel Walker The University of Texas at Austin nathaniel.walker@baesystems.com

More information

ASYNCHRONOUS SHADERS WHITE PAPER 0

ASYNCHRONOUS SHADERS WHITE PAPER 0 ASYNCHRONOUS SHADERS WHITE PAPER 0 INTRODUCTION GPU technology is constantly evolving to deliver more performance with lower cost and lower power consumption. Transistor scaling and Moore s Law have helped

More information

Benchmark 1.a Investigate and Understand Designated Lab Techniques The student will investigate and understand designated lab techniques.

Benchmark 1.a Investigate and Understand Designated Lab Techniques The student will investigate and understand designated lab techniques. I. Course Title Parallel Computing 2 II. Course Description Students study parallel programming and visualization in a variety of contexts with an emphasis on underlying and experimental technologies.

More information

ISAR IMAGING OF MULTIPLE TARGETS BASED ON PARTICLE SWARM OPTIMIZATION AND HOUGH TRANSFORM

ISAR IMAGING OF MULTIPLE TARGETS BASED ON PARTICLE SWARM OPTIMIZATION AND HOUGH TRANSFORM J. of Electromagn. Waves and Appl., Vol. 23, 1825 1834, 2009 ISAR IMAGING OF MULTIPLE TARGETS BASED ON PARTICLE SWARM OPTIMIZATION AND HOUGH TRANSFORM G.G.Choi,S.H.Park,andH.T.Kim Department of Electronic

More information

CUDA Experiences: Over-Optimization and Future HPC

CUDA Experiences: Over-Optimization and Future HPC CUDA Experiences: Over-Optimization and Future HPC Carl Pearson 1, Simon Garcia De Gonzalo 2 Ph.D. candidates, Electrical and Computer Engineering 1 / Computer Science 2, University of Illinois Urbana-Champaign

More information

This is an author produced version of Accurate Reconstruction and Suppression for Azimuth Ambiguities in Spaceborne Stripmap SAR Images.

This is an author produced version of Accurate Reconstruction and Suppression for Azimuth Ambiguities in Spaceborne Stripmap SAR Images. This is an author produced version of Accurate Reconstruction and Suppression for Azimuth Ambiguities in Spaceborne Stripmap SAR Images. White Rose Research Online URL for this paper: http://eprints.whiterose.ac.uk/38/

More information

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware

More information

Overview of High Performance Computing

Overview of High Performance Computing Overview of High Performance Computing Timothy H. Kaiser, PH.D. tkaiser@mines.edu http://inside.mines.edu/~tkaiser/csci580fall13/ 1 Near Term Overview HPC computing in a nutshell? Basic MPI - run an example

More information

Multiprocessor and Real-Time Scheduling. Chapter 10

Multiprocessor and Real-Time Scheduling. Chapter 10 Multiprocessor and Real-Time Scheduling Chapter 10 1 Roadmap Multiprocessor Scheduling Real-Time Scheduling Linux Scheduling Unix SVR4 Scheduling Windows Scheduling Classifications of Multiprocessor Systems

More information

The GPU-based Parallel Calculation of Gravity and Magnetic Anomalies for 3D Arbitrary Bodies

The GPU-based Parallel Calculation of Gravity and Magnetic Anomalies for 3D Arbitrary Bodies Available online at www.sciencedirect.com Procedia Environmental Sciences 12 (212 ) 628 633 211 International Conference on Environmental Science and Engineering (ICESE 211) The GPU-based Parallel Calculation

More information

Parallel Systems. Project topics

Parallel Systems. Project topics Parallel Systems Project topics 2016-2017 1. Scheduling Scheduling is a common problem which however is NP-complete, so that we are never sure about the optimality of the solution. Parallelisation is a

More information

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,

More information

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS Agenda Forming a GPGPU WG 1 st meeting Future meetings Activities Forming a GPGPU WG To raise needs and enhance information sharing A platform for knowledge

More information

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about

More information

The HPEC Challenge Benchmark Suite

The HPEC Challenge Benchmark Suite The HPEC Challenge Benchmark Suite Ryan Haney, Theresa Meuse, Jeremy Kepner and James Lebak Massachusetts Institute of Technology Lincoln Laboratory HPEC 2005 This work is sponsored by the Defense Advanced

More information

XIV International PhD Workshop OWD 2012, October Optimal structure of face detection algorithm using GPU architecture

XIV International PhD Workshop OWD 2012, October Optimal structure of face detection algorithm using GPU architecture XIV International PhD Workshop OWD 2012, 20 23 October 2012 Optimal structure of face detection algorithm using GPU architecture Dmitry Pertsau, Belarusian State University of Informatics and Radioelectronics

More information

Adaptive Waveform Inversion: Theory Mike Warner*, Imperial College London, and Lluís Guasch, Sub Salt Solutions Limited

Adaptive Waveform Inversion: Theory Mike Warner*, Imperial College London, and Lluís Guasch, Sub Salt Solutions Limited Adaptive Waveform Inversion: Theory Mike Warner*, Imperial College London, and Lluís Guasch, Sub Salt Solutions Limited Summary We present a new method for performing full-waveform inversion that appears

More information

Digital Processing of Synthetic Aperture Radar Data

Digital Processing of Synthetic Aperture Radar Data Digital Processing of Synthetic Aperture Radar Data Algorithms and Implementation Ian G. Cumming Frank H. Wong ARTECH HOUSE BOSTON LONDON artechhouse.com Contents Foreword Preface Acknowledgments xix xxiii

More information

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

COSC 6385 Computer Architecture - Thread Level Parallelism (I) COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance

More information

PARALLEL FRAMEWORK FOR PARTIAL WAVE ANALYSIS AT BES-III EXPERIMENT

PARALLEL FRAMEWORK FOR PARTIAL WAVE ANALYSIS AT BES-III EXPERIMENT PARALLEL FRAMEWORK FOR PARTIAL WAVE ANALYSIS AT BES-III EXPERIMENT V.A. Tokareva a, I.I. Denisenko Laboratory of Nuclear Problems, Joint Institute for Nuclear Research, 6 Joliot-Curie, Dubna, Moscow region,

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

Very fast simulation of nonlinear water waves in very large numerical wave tanks on affordable graphics cards

Very fast simulation of nonlinear water waves in very large numerical wave tanks on affordable graphics cards Very fast simulation of nonlinear water waves in very large numerical wave tanks on affordable graphics cards By Allan P. Engsig-Karup, Morten Gorm Madsen and Stefan L. Glimberg DTU Informatics Workshop

More information

The determination of the correct

The determination of the correct SPECIAL High-performance SECTION: H i gh-performance computing computing MARK NOBLE, Mines ParisTech PHILIPPE THIERRY, Intel CEDRIC TAILLANDIER, CGGVeritas (formerly Mines ParisTech) HENRI CALANDRA, Total

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

AUTOMATIC TARGET RECOGNITION IN HIGH RESOLUTION SAR IMAGE BASED ON BACKSCATTERING MODEL

AUTOMATIC TARGET RECOGNITION IN HIGH RESOLUTION SAR IMAGE BASED ON BACKSCATTERING MODEL AUTOMATIC TARGET RECOGNITION IN HIGH RESOLUTION SAR IMAGE BASED ON BACKSCATTERING MODEL Wang Chao (1), Zhang Hong (2), Zhang Bo (1), Wen Xiaoyang (1), Wu Fan (1), Zhang Changyao (3) (1) National Key Laboratory

More information