ACCELERATING MOTION-COMPENSATED ADAPTIVE COLOR DOPPLER ENGINE ON CUDA-BASED GPU PLATFORM

Size: px

Start display at page:

Download "ACCELERATING MOTION-COMPENSATED ADAPTIVE COLOR DOPPLER ENGINE ON CUDA-BASED GPU PLATFORM"

Allyson Waters
5 years ago
Views:

1 2013 IEEE Workshop on Signal Processing Systems ACCELERATING MOTION-COMPENSATED ADAPTIVE COLOR DOPPLER ENGINE ON CUDA-BASED GPU PLATFORM I-Hsuan Lee, Yu-Hao Chen, Nai-Shan Huang, and An-Yeu (Andy) Wu Graduate Institute of Electronics Engineering, National Taiwan University, Taipei, 106, Taiwan, R.O.C. {Raylee, zerobigtree, artifacts can be effectively suppressed while preserving most of the blood signals. Besides the two algorithms, we also combine the adaptive-size median filtering [4] into the design. The proposed motion-compensated color Doppler engine can improve the SCR by 3~9dB and reduce the blood velocity error by over 69%. Real-time imaging is the major advantage of ultrasound imaging comparing to other medical diagnostic imaging methods, such as MRI and CT. Unfortunately, the computational complexity of the proposed color Doppler engine is too high to achieve real-time imaging in CPUbased desktop PCs. In this work, we explore the parallelism and data locality of the proposed color Doppler algorithms. By using the methodology, we can accelerate the computation by implementing the Doppler engine in Compute Unified Device Architecture (CUDA) language [9]. CUDA is adopted according to three reasons: Firstly, CUDA is more flexible than other parallel programming by allowing inter-thread and inter-block communications via different memories. Secondly, CUDA programs are compatible to different CUDA-capable GPUs with proper segmentation of the parallel processing kernel. Thirdly, since each kernel is called by the host as a subroutine, the color Doppler engine can be easily upgraded with new algorithms by adding a new kernel without modified the original kernels. With the proposed parallelism and data locality of the color Doppler algorithms, the execution time is 54.8 times faster on GPU than on CPU. This paper is organized as follows. The color Doppler engine is reviewed in Section 2. In Section 3, we present the GPU-based implementation of the color Doppler engine. The experimental results and comparisons are provided in Section 4. Finally, the conclusion is given in Section 5. ABSTRACT Color Doppler imaging is used to observe the blood flow distribution during doctor s diagnosis. However, the desired blood signal is greatly affected by clutter noise and probe motions. In previous works, a color Doppler engine was proposed consisting algorithms which can effectively eliminate clutter noises and motion artifacts. Since a large number of data and computations are involved, the color Doppler engine cannot achieve real-time imaging on CPUbased PCs. Therefore, we accelerate the color Doppler engine by parallelizing the executions on many-core GPU. In this work, we explore the parallelism and data locality of the motion-compensated color Doppler algorithms, and implement it on CUDA-based GPU platform. The speedup can be up to 54.8 by using the proposed design methodology. Index Terms Color Doppler, CUDA, Ultrasound Imaging, GPGPU, Motion-compensated 1. INTRODUCTION Color Doppler imaging is a well-established ultrasound mode, and valuable for observing blood flow distribution in a specific region of interest [2]. Since blood flow signal is very weak comparing to clutter noise, suppressing clutter noise effectively is one of the major problems in color Doppler processing. In clinical examination, relative motion, such as motion caused by breathing, between ultrasonic probe and the target region can severely degrade the image quality. To suppress clutter noise with motion artifacts, eigenbased clutter filter was proposed in [10], which adapts the passband and stopband of clutter filter to the moving tissue. In eigen-based clutter filter, mean frequency of dominant eigen-component is regarded as center frequency of clutter noise. Then, the clutter noise with tissue motion can be suppressed adaptively. However, blood signal may be removed incorrectly by this method when the signal-toclutter ratio (SCR) of eigen-component distribution is high. In [1], a velocity bias cancellation algorithm was proposed to eliminate the motion artifacts and compensate for the biased flow velocity. Also, to avoid incorrect suppression of blood signal, a joint-decision clutter filter was also proposed in [8]. By combining above two algorithms in [1] and [8], the clutter noise with motion 2. REVIEW OF MOTION-COMPENSATED COLOR DOPPLER ENGINE As shown in Fig. 1, the color Doppler engine contains six function blocks and we will describe each functional block in this section. A. Biased Velocity Estimation [1] Since the tissue region is relatively stationary, the velocity bias of a frame is approximately the mean velocity of the tissue region. However, the mean velocity of blood region is /13 $ IEEE 225

Table 1. Parameters and functions of autocorrelation technique. Parameters Function Lag-0 autocorrelation (0) = 1 ( ) ( ) Lag-1 autocorrelation ( ) = 1 1 (( + 1) ) ( ) Fig. 1. Block diagram of the color Doppler processing.

In order to obtain velocity bias, energy is used as weighting to reduce the influence of blood velocity.

2 Table 1. Parameters and functions of autocorrelation technique. Parameters Function Lag-0 autocorrelation (0) = 1 ( ) ( ) Lag-1 autocorrelation ( ) = 1 1 (( + 1) ) ( ) Fig. 1. Block diagram of the color Doppler processing. much higher than of tissue region, so the mean velocity of a frame is not equal to the mean tissue velocity. In order to obtain velocity bias, energy is used as weighting to reduce the influence of blood velocity. Thus, the velocity bias can be computed as follows: =,,,, (1) where M is the frame height and N is the frame width., is the energy of the ensemble of location (m, n), which is defined as, = ( ) ( ). (2) K is the ensemble size. Then,, is the velocity of the ensemble which can be obtained by, = (( +1) ) ( ), (3), =,. (4) where is the sound speed in human body, is the center frequency of the ultrasound wave and represents the phase of. As discussed in [1], the velocity bias is not only used for motion compensation, but it also can help the frequency threshold adjustment of the joint-decision clutter filter to improve the performance. B. Joint-decision Clutter Filter [8] The joint-decision clutter filter [8] combines the eigen-based with frequency-based criterion. The eigen-based criterion removes clutter noise by an eigenvalue ratio threshold. Since the energy of clutter eigen-component is larger than blood eigen-component, an eigen-component is considered as clutter noise when the eigenvalue ratio of the eigencomponents exceeds the threshold. In the frequency-based criterion, when the difference between the velocity of an eigen-component and the velocity bias is smaller than an adaptive threshold, the eigen-component is regarded as clutter noise. In [1], we discovered that there is a linear relation between the clutter noise spectrum width and the velocity bias. Hence the frequency threshold is tuned by the velocity bias adaptively. 2 Velocity V = 4 ( ) Variance σ = 2 1 ( ) (0) Energy E = (0) The joint-decision clutter filter removes an eigencomponent only when both eigen-based and frequencybased criterions are satisfied. Hence, misjudgments can be avoided and the blood signal loss is minimized. To further accelerate computation, we implement the most complex function, the eigenvalue decomposition, with the SL-SVD algorithm [5]. The SL-SVD algorithm not only converges rapidly but also has a low computational complexity as well. Furthermore, since the autocorrelation matrices of pixels are Hermitian, the matrix multiplications in kernel 2 can be simplified and the computational complexity is reduced about 37.5%. C. Adaptive Thresholding The purpose of adaptive thresholding is to remove the remaining noises after clutter filtering, such as white noise. Since most of the clutter noises are suppressed by the clutter filter, if the energy proportion removed by clutter filtering exceeds a threshold, the pixel is regarded as locating in the tissue region and the filtered signal is considered as noise. Hence, the pixel will not display in the color flow image. D. Flow Parameter Estimation [2] We adopt autocorrelation [2] to calculate the flow parameters, which is widely used in other references. For each Doppler ensemble received from a particular gate, the functions of parameters are listed in Table. 1. After the flow velocity is calculated, the compensated velocity is obtained by subtracting from directly. By this method, the error of blood velocity is reduced over 69% than conventional methods [1]. E. Adaptive Persistence Adaptive persistence temporally smoothies the successive frames by using a pixel-by-pixel 2-tap IIR filter. A forgetting factor α with the range 0 α<1 is selected to obtain the new average from the previous average and new frame adaptive to the condition of the target condition, which is written as: = +(1 ). (5) 226

The forgetting factor a is larger under a stationary condition so that the previous average contributes more to the new average to improve the image quality. F.

Large size median filter suppresses noise better than the small size filter, but the image is more blurred.

The window size of the adaptive-size median filter is selected based on the energy of the target pixel, because the energy of noise after clutter filtering and thresholding is much smaller than the

To further improve the quality, the energy of 8 adjacent pixels of the target is also examined to improving the accuracy of judgment of the anatomical boundaries and the blood region. Fig. 2.

MOTION-COMPENSATED COLOR DOPPLER ENGINE ON CUDA-BASED GPU PLATFORM We implement the color Doppler engine on CUDA-based GPU platform.

Also, K1, K2, and K3 are the size of thread blocks of kernel 1, kernel 2, and kernel 3 in Fig. 2, respectively.

3 The forgetting factor a is larger under a stationary condition so that the previous average contributes more to the new average to improve the image quality. F. Adaptive-size Median Filter [4] Median filter [4] is the most common technique in filtering the salt-and-pepper noise. Large size median filter suppresses noise better than the small size filter, but the image is more blurred. Since fixed-size median filter cannot meet the sharpness and quality simultaneously, adaptivesize median filter [3] is proposed to combine both advantage of large size and small size median filters. The window size of the adaptive-size median filter is selected based on the energy of the target pixel, because the energy of noise after clutter filtering and thresholding is much smaller than the energy of blood signals. Small size median filter (3 3) is chosen in blood region for better sharpness while large size median filter (5 5) is selected in tissue region for better noise suppression. To further improve the quality, the energy of 8 adjacent pixels of the target is also examined to improving the accuracy of judgment of the anatomical boundaries and the blood region. Fig. 2. Flowchart of proposed color Doppler engine implemented on CUDA-based GPU platform. 3. MOTION-COMPENSATED COLOR DOPPLER ENGINE ON CUDA-BASED GPU PLATFORM We implement the color Doppler engine on CUDA-based GPU platform. The flowchart of the GPU-based implementation of the motion-compensated color Doppler engine is shown as Fig. 2, where N is the size of a frame and M is the number of rows in a frame. Also, K1, K2, and K3 are the size of thread blocks of kernel 1, kernel 2, and kernel 3 in Fig. 2, respectively. Since the algorithms of the color Doppler engine have different parallelism and data locality, we divide the color Doppler engine into three kernels. The first kernel is the biased velocity estimation. The second kernel includes four functions which are the clutter filter, the thresholding, the flow parameter estimation, and the persistence. And the third kernel is the adaptive-size median filter. Fig. 3. Image segmentation and corresponding thread structure of pixel-level parallelism. (IB represents the image block and TK represents the K-th thread in thread block) A. Pixel-level parallelism Kernel 1 is segmented based on the parallelism of stage 1, since stage 1 has the highest computational complexity of all stages. We discovered that the data of each pixel are independent of other pixels, thus pixel-level parallelism is adopted. The image segmentation and the corresponding thread structure of pixel-level parallelism are shown in Fig. 3, where M is the number of blocks and K is the size of blocks. In kernel 1, a color Doppler frame is divided into many image blocks. Each image block is processed by a corresponding thread block, and every thread block will be processed in parallel on GPU. Since the total number of threads is equal to the number of pixels of a frame, the next issue is how to decide the size of thread blocks. To improve memory efficiency, the block size should be large enough to make good use of the registers and shared memory. Moreover, large block size can increase parallelism and hide the latency of memory access Kernel 1 (Biased velocity estimation) According to the functions of the proposed biased velocity estimation algorithm, we divide kernel 1 into three stages. The first stage computes the energy and velocity of each pixel from (2), (3) and (4). Then, the second stage and the third stage estimate the velocity bias from (1)

Fig. 4. The implementation of summations is based on binary adder tree architecture. B.

Hence, the execution time of the summations is reduced from ( ) to (log ), where is the number of pixels in a frame.

4, can avoid bank conflicts of shared memory and then reduce the execution time further.

Therefore, only the first thread in the first block is active and calculates the velocity bias in stage 3, as shown in Fig. 2.

joint-decision adaptive clutter filter, adaptive thresholding, flow parameter estimation, and adaptive persistence.

4 Fig. 4. The implementation of summations is based on binary adder tree architecture. B. Summations based on binary adder tree architecture Due to pixel-level parallelism of kernel 1, we can implement the summations in parallel based on the architecture of a binary adder tree. Hence, the execution time of the summations is reduced from ( ) to (log ), where is the number of pixels in a frame. The summations are separated into inter-thread summations and inter-block summations because of the segmentation of kernel 1. The summations, as shown in Fig. 4, can avoid bank conflicts of shared memory and then reduce the execution time further. According to (1), only a single division of the sum of weighted velocity and the total energy is computed in stage 3, thus it can only be processed by a single thread. Therefore, only the first thread in the first block is active and calculates the velocity bias in stage 3, as shown in Fig Kernel 2 (Clutter filter, adaptive thresholding, flow parameter estimation, and adaptive persistence) We merge the algorithms with the same data dependency into kernel 2, which are the joint-decision adaptive clutter filter, adaptive thresholding, flow parameter estimation, and adaptive persistence. Because the data of pixels are independent of other pixels in kernel 2, pixel-level parallelism is adopted. We separate kernel 2 from kernel 1 due to the difference of data locality. Unlike kernel 1, there is neither inter-thread nor inter-block communication in kernel 2. Thus, kernel 2 is fully parallelized and accelerated times faster on GPU. The execution flow of kernel 2 is shown in Fig. 2. Due to fully parallel, all data is passed by registers as shown in Fig. 2. Since the output of prior function is exactly the input of posterior function in kernel 2, each thread is executed without any interruptions from joint-decision clutter filter to adaptive persistence after the velocity bias and raw data are loaded from global memory Kernel 3 (Adaptive-size median filter) The adaptive-size median filter is implemented in kernel 3 due to the different data dependency from the algorithms in prior kernels. The output of a median filter depends on several neighboring pixels of the target pixel. On the other Fig. 5. Segmentation and data partition of kernel 3 in row-level parallelism. hand, each pixel will be accessed several times by different filters. For example, each pixel will be accessed 25 times with the window size 5 5. The data dependency leads to severe global memory conflicts and extremely high latency. Therefore, pixel-level parallelism is not suitable for kernel 3. In fact, the execution time of kernel 3 in pixel-level parallelism is about 30ms, which is too slow for real-time imaging. A. Row-level parallelism We noticed that the adjacent pixels in the same row can reuse most of the data during median filtering. Hence, instead of pixel-level parallelism, we parallel the adaptivesize median filter in row-level parallelism. In row-level parallelism, each row of a frame is processed by a corresponding thread. The segmentation and data partition in row-level parallelism is shown in Fig. 5, where M is the number of blocks in a frame and K is the size of the blocks. In this way, the number of memory access is reduced considerably and the memory conflict is also decreased. After the total number of threads is decided, the next issue is deciding the block size. Due to the limited amount of shared memory, the block size depends on width of the frame. Large blocks can reduce memory overhead by reducing the number of rows that needs to be duplicated between adjacent image blocks, which is marked by yellow rectangular in Fig. 5. Another benefit of row-level parallelism is coalesced global memory access. Since the data of color Doppler image is stored in vertical raster scan order, the addresses accessed by threads in row-level parallelism are successive. With coalesced access, the latency of global memory access can be reduced significantly. A median filter algorithm based on row-level parallelism is proposed. The proposed median filter algorithm can be divided into two steps: window updating and median value calculation

Velocity 0.2 Velocity 0.2 500 0.15 500 0.15 1000 0.1 1000 0.1 1500 0.05 1500 0.05 2000 0 2000 0 2500-0.05 2500-0.05 Fig. 6. Window updating of the pixel at row 2 and column 3. 3000-0.1 3000-0.1-0.

Window updating The window of each thread is a 25-element array, because the largest window in adaptive-size median filter is 5 5.

During window updating of i-th column, only 5 pixels from (i-3)-th column is replaced by 5 pixels from (i+2)-th column, while the other 20 pixels remain unchanged.

Median value calculation Before finding the median value, the filter size is selected by the criterion mentioned in section 2.

5 Velocity 0.2 Velocity Fig. 6. Window updating of the pixel at row 2 and column (a) (b) Fig. 8. Flow velocity image obtains from (a) CPU and (b) GPU. Fig. 7. Validity flag table of the pixel at row 2 and column 3. B. Window updating The window of each thread is a 25-element array, because the largest window in adaptive-size median filter is 5 5. To ensure that the window is allocated in registers, the window is implemented by 5 small arrays. The window updating is shown as Fig. 6. During window updating of i-th column, only 5 pixels from (i-3)-th column is replaced by 5 pixels from (i+2)-th column, while the other 20 pixels remain unchanged. Hence, the shared memory access and the bank conflicts are reduced considerably. C. Median value calculation Before finding the median value, the filter size is selected by the criterion mentioned in section 2. We use a validity flag table to represent whether an element is valid for median value calculation, as shown in Fig. 7. For example, only the 9 corresponding elements in the 3 3 window are valid when 3 3 median filter is applied. Otherwise, all elements are valid. A median selection algorithm based on incomplete selection sort is proposed. The incomplete selection sort has two advantages of simple to implement and good performance in small size windows. Therefore, we adopt the incomplete selection sort algorithm and adapt the algorithm for row-level parallelism. The proposed algorithm uses the validity flags instead of swapping the elements to avoid the redundant comparisons of sorted elements. Due to no data swapping, the proposed algorithm has two major benefits. First, the latency is reduced because of the reduction of memory access. Second, the proposed algorithm is compatible with window updating, since the data in window remain unchanged during the median value calculation. The procedure of the proposed algorithm is described as follows. In the beginning, we find the maximums of the window by comparing the valid elements in the order of selection sort. Then, we record the maximum and disable the corresponding validity flag in the table. Also, the number of sorted elements is updated. After that, we repeat the procedure in order to find the maximum of the remaining valid elements. The median value is the maximum when the number of sorted elements reaches the threshold, which is 5 for 3 3 median filter and 13 for 5 5 median filter. Finally, the data is written back to the global memory from shared memory by coalesced access after the median filtering is completed. With coalesced access, the latency of data transaction is minimized. 4. SIMULATION RESULTS AND COMPARISONS The simulation environment of the CPU-based software implementation is Windows 7 OS with an Intel i CPU running at 3.10 GHz and coupled to 12 GB of RAM, and the GPU-based software implementation is on the NVIDIA Tesla C2075 with 480 cores. The synthetic data of carotid artery are used in the simulation, and the data were generated by Field II [6] program with simulation parameters of 1540 m/s sound speed, 5 MHz center frequency, 20 MHz sampling frequency and 3.5 KHz pulse repetition rate. Note that during the color Doppler engine process, all data is represented in single precision floating point number since the operations of single precision floating point number is optimized on GPU. Fig. 8 presents the flow velocity image of the proposed color Doppler engine from CPU and GPU. The frame size is 30 scan-lines with 3840 samples along the lines. The SCR of CPU and GPU implementation are 8.23 db and 8.06 db respectively. The execution time of the CPU-based implementation and GPU-based implementation are compared in Table 2. The total execution time of the CPU-based implementation is 422ms per frame in desktop PC. In contrast, the execution time of GPU-based implementation can be reduced to 7.7ms

6 Even with the extra time for data copy, the overall acceleration of GPU-based implementation is about 55.8 times faster and the frame rate is improved significantly from 2.4 fps to fps. Each kernel is executing much faster in the proposed GPU-based implementation. Especially the kernel 2, which achieves times acceleration due to the fully parallelized characteristic. Table 2. The comparison of execution time between CPU and GPU. Timing (ms) Intel(R)Core (TM) i GHz Tesla C2075 GPU Speed up Memory copy (CPU to GPU) 3.23 Kernel x Kernel x Kernel x Memory copy (GPU to CPU) 1.23 Total x [6] J. A. Jensen, Field: a program for simulating ultrasound systems, Medical and Biological Engineering & Computing, vol. 34, pp , [7] T.-H. Yu, S.-Y. Sun, C.-L. Ding, P.-C. Li and A.-Y. Wu, "Reconfigurable color Doppler DSP engine for high-frequency ultrasonic imaging systems," in IEEE Workshop on Signal Processing Systems (SiPS), pp , Oct [8] K.-T. Chang, C.-Z. Zhan, and A.-Y. Wu, Joint-Decision Adaptive Clutter Filter and Motion-Tracking Adaptive Persistence for Color Doppler Processing in Ultrasonic Systems, in IEEE Workshop on Signal Processing Systems (SiPS), pp , Oct [9] CUDA C Programming Guide, version 4.2, NVIDIA, 2012 [10] S. Bjaerum, H. Torp, and K. Kristoffersen, Clutter filters adapted to tissue motion in ultrasound color flow imaging, IEEE Trans. Ultrason. Ferroelectr. Freq. Control, vol. 49, no. 6, pp , CONCLUSIONS In this paper, we present the GPU-based implementation of our color Doppler engine. In order to adapt to the CUDAbased GPU platform, we examine the parallelism and data locality of our algorithms and modify them. The simulation results confirmed that our color Doppler engine is feasible on the GPU platform. The overall acceleration is up to 54.8 times with only slightly image quality lost caused by the intrinsic error of GPU computations. 6. ACKNOWLEDGMENT This work is supported by MOEA, Taiwan, under Grant no. 100-EC-17-A-19-S REFERENCES [1] Z.-L. Liu, C.-Z. Zhan, and A.-Y. (Andy) Wu, Motion artifact elimination algorithm with eigen-based clutter filter for color Doppler processing, to appear in Proc IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), May [2] C. Kasai, K. Namekawa, A. Koyano, and R. Omoto, Real-time twodimensional blood flow imaging using an autocorrelation technique, IEEE Trans. Sonics and Ultrasonics, vol.32, no.3, pp.458,464, May [3] C.-Z. Zhan, K.-T. Chang, Y.-H. Chen, P.-C. Li, and A.-Y. Wu, Motion-Tracking Adaptive Persistence and Adaptive-Size Median Filter for Color Doppler Processing in Ultrasonic Systems on Multi-core Platform, in Proc. IEEE Conference on Biomedical Circuits and Systems (BioCAS), pp.58-61, Nov [4] F. Sattar, L. Floreby, G. Salomonsson, and B. Lovstrom, Image enhancement based on a nonlinear multiscale method, IEEE Trans. Image Processing, vol. 6, pp , June [5] C.-Z. Zhan, Y.-L. Chen and A.-Y. Wu, "Iterative Superlinear- Convergence SVD Beamforming Algorithm and VLSI Architecture for MIMO-OFDM Systems," IEEE Trans. Signal Processing, vol. 60, pp , June

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors Michael Boyer, David Tarjan, Scott T. Acton, and Kevin Skadron University of Virginia IPDPS 2009 Outline Leukocyte