ACCELERATING MOTION-COMPENSATED ADAPTIVE COLOR DOPPLER ENGINE ON CUDA-BASED GPU PLATFORM

Size: px
Start display at page:

Download "ACCELERATING MOTION-COMPENSATED ADAPTIVE COLOR DOPPLER ENGINE ON CUDA-BASED GPU PLATFORM"

Transcription

1 2013 IEEE Workshop on Signal Processing Systems ACCELERATING MOTION-COMPENSATED ADAPTIVE COLOR DOPPLER ENGINE ON CUDA-BASED GPU PLATFORM I-Hsuan Lee, Yu-Hao Chen, Nai-Shan Huang, and An-Yeu (Andy) Wu Graduate Institute of Electronics Engineering, National Taiwan University, Taipei, 106, Taiwan, R.O.C. {Raylee, zerobigtree, artifacts can be effectively suppressed while preserving most of the blood signals. Besides the two algorithms, we also combine the adaptive-size median filtering [4] into the design. The proposed motion-compensated color Doppler engine can improve the SCR by 3~9dB and reduce the blood velocity error by over 69%. Real-time imaging is the major advantage of ultrasound imaging comparing to other medical diagnostic imaging methods, such as MRI and CT. Unfortunately, the computational complexity of the proposed color Doppler engine is too high to achieve real-time imaging in CPUbased desktop PCs. In this work, we explore the parallelism and data locality of the proposed color Doppler algorithms. By using the methodology, we can accelerate the computation by implementing the Doppler engine in Compute Unified Device Architecture (CUDA) language [9]. CUDA is adopted according to three reasons: Firstly, CUDA is more flexible than other parallel programming by allowing inter-thread and inter-block communications via different memories. Secondly, CUDA programs are compatible to different CUDA-capable GPUs with proper segmentation of the parallel processing kernel. Thirdly, since each kernel is called by the host as a subroutine, the color Doppler engine can be easily upgraded with new algorithms by adding a new kernel without modified the original kernels. With the proposed parallelism and data locality of the color Doppler algorithms, the execution time is 54.8 times faster on GPU than on CPU. This paper is organized as follows. The color Doppler engine is reviewed in Section 2. In Section 3, we present the GPU-based implementation of the color Doppler engine. The experimental results and comparisons are provided in Section 4. Finally, the conclusion is given in Section 5. ABSTRACT Color Doppler imaging is used to observe the blood flow distribution during doctor s diagnosis. However, the desired blood signal is greatly affected by clutter noise and probe motions. In previous works, a color Doppler engine was proposed consisting algorithms which can effectively eliminate clutter noises and motion artifacts. Since a large number of data and computations are involved, the color Doppler engine cannot achieve real-time imaging on CPUbased PCs. Therefore, we accelerate the color Doppler engine by parallelizing the executions on many-core GPU. In this work, we explore the parallelism and data locality of the motion-compensated color Doppler algorithms, and implement it on CUDA-based GPU platform. The speedup can be up to 54.8 by using the proposed design methodology. Index Terms Color Doppler, CUDA, Ultrasound Imaging, GPGPU, Motion-compensated 1. INTRODUCTION Color Doppler imaging is a well-established ultrasound mode, and valuable for observing blood flow distribution in a specific region of interest [2]. Since blood flow signal is very weak comparing to clutter noise, suppressing clutter noise effectively is one of the major problems in color Doppler processing. In clinical examination, relative motion, such as motion caused by breathing, between ultrasonic probe and the target region can severely degrade the image quality. To suppress clutter noise with motion artifacts, eigenbased clutter filter was proposed in [10], which adapts the passband and stopband of clutter filter to the moving tissue. In eigen-based clutter filter, mean frequency of dominant eigen-component is regarded as center frequency of clutter noise. Then, the clutter noise with tissue motion can be suppressed adaptively. However, blood signal may be removed incorrectly by this method when the signal-toclutter ratio (SCR) of eigen-component distribution is high. In [1], a velocity bias cancellation algorithm was proposed to eliminate the motion artifacts and compensate for the biased flow velocity. Also, to avoid incorrect suppression of blood signal, a joint-decision clutter filter was also proposed in [8]. By combining above two algorithms in [1] and [8], the clutter noise with motion 2. REVIEW OF MOTION-COMPENSATED COLOR DOPPLER ENGINE As shown in Fig. 1, the color Doppler engine contains six function blocks and we will describe each functional block in this section. A. Biased Velocity Estimation [1] Since the tissue region is relatively stationary, the velocity bias of a frame is approximately the mean velocity of the tissue region. However, the mean velocity of blood region is /13 $ IEEE 225

2 Table 1. Parameters and functions of autocorrelation technique. Parameters Function Lag-0 autocorrelation (0) = 1 ( ) ( ) Lag-1 autocorrelation ( ) = 1 1 (( + 1) ) ( ) Fig. 1. Block diagram of the color Doppler processing. much higher than of tissue region, so the mean velocity of a frame is not equal to the mean tissue velocity. In order to obtain velocity bias, energy is used as weighting to reduce the influence of blood velocity. Thus, the velocity bias can be computed as follows: =,,,, (1) where M is the frame height and N is the frame width., is the energy of the ensemble of location (m, n), which is defined as, = ( ) ( ). (2) K is the ensemble size. Then,, is the velocity of the ensemble which can be obtained by, = (( +1) ) ( ), (3), =,. (4) where is the sound speed in human body, is the center frequency of the ultrasound wave and represents the phase of. As discussed in [1], the velocity bias is not only used for motion compensation, but it also can help the frequency threshold adjustment of the joint-decision clutter filter to improve the performance. B. Joint-decision Clutter Filter [8] The joint-decision clutter filter [8] combines the eigen-based with frequency-based criterion. The eigen-based criterion removes clutter noise by an eigenvalue ratio threshold. Since the energy of clutter eigen-component is larger than blood eigen-component, an eigen-component is considered as clutter noise when the eigenvalue ratio of the eigencomponents exceeds the threshold. In the frequency-based criterion, when the difference between the velocity of an eigen-component and the velocity bias is smaller than an adaptive threshold, the eigen-component is regarded as clutter noise. In [1], we discovered that there is a linear relation between the clutter noise spectrum width and the velocity bias. Hence the frequency threshold is tuned by the velocity bias adaptively. 2 Velocity V = 4 ( ) Variance σ = 2 1 ( ) (0) Energy E = (0) The joint-decision clutter filter removes an eigencomponent only when both eigen-based and frequencybased criterions are satisfied. Hence, misjudgments can be avoided and the blood signal loss is minimized. To further accelerate computation, we implement the most complex function, the eigenvalue decomposition, with the SL-SVD algorithm [5]. The SL-SVD algorithm not only converges rapidly but also has a low computational complexity as well. Furthermore, since the autocorrelation matrices of pixels are Hermitian, the matrix multiplications in kernel 2 can be simplified and the computational complexity is reduced about 37.5%. C. Adaptive Thresholding The purpose of adaptive thresholding is to remove the remaining noises after clutter filtering, such as white noise. Since most of the clutter noises are suppressed by the clutter filter, if the energy proportion removed by clutter filtering exceeds a threshold, the pixel is regarded as locating in the tissue region and the filtered signal is considered as noise. Hence, the pixel will not display in the color flow image. D. Flow Parameter Estimation [2] We adopt autocorrelation [2] to calculate the flow parameters, which is widely used in other references. For each Doppler ensemble received from a particular gate, the functions of parameters are listed in Table. 1. After the flow velocity is calculated, the compensated velocity is obtained by subtracting from directly. By this method, the error of blood velocity is reduced over 69% than conventional methods [1]. E. Adaptive Persistence Adaptive persistence temporally smoothies the successive frames by using a pixel-by-pixel 2-tap IIR filter. A forgetting factor α with the range 0 α<1 is selected to obtain the new average from the previous average and new frame adaptive to the condition of the target condition, which is written as: = +(1 ). (5) 226

3 The forgetting factor a is larger under a stationary condition so that the previous average contributes more to the new average to improve the image quality. F. Adaptive-size Median Filter [4] Median filter [4] is the most common technique in filtering the salt-and-pepper noise. Large size median filter suppresses noise better than the small size filter, but the image is more blurred. Since fixed-size median filter cannot meet the sharpness and quality simultaneously, adaptivesize median filter [3] is proposed to combine both advantage of large size and small size median filters. The window size of the adaptive-size median filter is selected based on the energy of the target pixel, because the energy of noise after clutter filtering and thresholding is much smaller than the energy of blood signals. Small size median filter (3 3) is chosen in blood region for better sharpness while large size median filter (5 5) is selected in tissue region for better noise suppression. To further improve the quality, the energy of 8 adjacent pixels of the target is also examined to improving the accuracy of judgment of the anatomical boundaries and the blood region. Fig. 2. Flowchart of proposed color Doppler engine implemented on CUDA-based GPU platform. 3. MOTION-COMPENSATED COLOR DOPPLER ENGINE ON CUDA-BASED GPU PLATFORM We implement the color Doppler engine on CUDA-based GPU platform. The flowchart of the GPU-based implementation of the motion-compensated color Doppler engine is shown as Fig. 2, where N is the size of a frame and M is the number of rows in a frame. Also, K1, K2, and K3 are the size of thread blocks of kernel 1, kernel 2, and kernel 3 in Fig. 2, respectively. Since the algorithms of the color Doppler engine have different parallelism and data locality, we divide the color Doppler engine into three kernels. The first kernel is the biased velocity estimation. The second kernel includes four functions which are the clutter filter, the thresholding, the flow parameter estimation, and the persistence. And the third kernel is the adaptive-size median filter. Fig. 3. Image segmentation and corresponding thread structure of pixel-level parallelism. (IB represents the image block and TK represents the K-th thread in thread block) A. Pixel-level parallelism Kernel 1 is segmented based on the parallelism of stage 1, since stage 1 has the highest computational complexity of all stages. We discovered that the data of each pixel are independent of other pixels, thus pixel-level parallelism is adopted. The image segmentation and the corresponding thread structure of pixel-level parallelism are shown in Fig. 3, where M is the number of blocks and K is the size of blocks. In kernel 1, a color Doppler frame is divided into many image blocks. Each image block is processed by a corresponding thread block, and every thread block will be processed in parallel on GPU. Since the total number of threads is equal to the number of pixels of a frame, the next issue is how to decide the size of thread blocks. To improve memory efficiency, the block size should be large enough to make good use of the registers and shared memory. Moreover, large block size can increase parallelism and hide the latency of memory access Kernel 1 (Biased velocity estimation) According to the functions of the proposed biased velocity estimation algorithm, we divide kernel 1 into three stages. The first stage computes the energy and velocity of each pixel from (2), (3) and (4). Then, the second stage and the third stage estimate the velocity bias from (1)

4 Fig. 4. The implementation of summations is based on binary adder tree architecture. B. Summations based on binary adder tree architecture Due to pixel-level parallelism of kernel 1, we can implement the summations in parallel based on the architecture of a binary adder tree. Hence, the execution time of the summations is reduced from ( ) to (log ), where is the number of pixels in a frame. The summations are separated into inter-thread summations and inter-block summations because of the segmentation of kernel 1. The summations, as shown in Fig. 4, can avoid bank conflicts of shared memory and then reduce the execution time further. According to (1), only a single division of the sum of weighted velocity and the total energy is computed in stage 3, thus it can only be processed by a single thread. Therefore, only the first thread in the first block is active and calculates the velocity bias in stage 3, as shown in Fig Kernel 2 (Clutter filter, adaptive thresholding, flow parameter estimation, and adaptive persistence) We merge the algorithms with the same data dependency into kernel 2, which are the joint-decision adaptive clutter filter, adaptive thresholding, flow parameter estimation, and adaptive persistence. Because the data of pixels are independent of other pixels in kernel 2, pixel-level parallelism is adopted. We separate kernel 2 from kernel 1 due to the difference of data locality. Unlike kernel 1, there is neither inter-thread nor inter-block communication in kernel 2. Thus, kernel 2 is fully parallelized and accelerated times faster on GPU. The execution flow of kernel 2 is shown in Fig. 2. Due to fully parallel, all data is passed by registers as shown in Fig. 2. Since the output of prior function is exactly the input of posterior function in kernel 2, each thread is executed without any interruptions from joint-decision clutter filter to adaptive persistence after the velocity bias and raw data are loaded from global memory Kernel 3 (Adaptive-size median filter) The adaptive-size median filter is implemented in kernel 3 due to the different data dependency from the algorithms in prior kernels. The output of a median filter depends on several neighboring pixels of the target pixel. On the other Fig. 5. Segmentation and data partition of kernel 3 in row-level parallelism. hand, each pixel will be accessed several times by different filters. For example, each pixel will be accessed 25 times with the window size 5 5. The data dependency leads to severe global memory conflicts and extremely high latency. Therefore, pixel-level parallelism is not suitable for kernel 3. In fact, the execution time of kernel 3 in pixel-level parallelism is about 30ms, which is too slow for real-time imaging. A. Row-level parallelism We noticed that the adjacent pixels in the same row can reuse most of the data during median filtering. Hence, instead of pixel-level parallelism, we parallel the adaptivesize median filter in row-level parallelism. In row-level parallelism, each row of a frame is processed by a corresponding thread. The segmentation and data partition in row-level parallelism is shown in Fig. 5, where M is the number of blocks in a frame and K is the size of the blocks. In this way, the number of memory access is reduced considerably and the memory conflict is also decreased. After the total number of threads is decided, the next issue is deciding the block size. Due to the limited amount of shared memory, the block size depends on width of the frame. Large blocks can reduce memory overhead by reducing the number of rows that needs to be duplicated between adjacent image blocks, which is marked by yellow rectangular in Fig. 5. Another benefit of row-level parallelism is coalesced global memory access. Since the data of color Doppler image is stored in vertical raster scan order, the addresses accessed by threads in row-level parallelism are successive. With coalesced access, the latency of global memory access can be reduced significantly. A median filter algorithm based on row-level parallelism is proposed. The proposed median filter algorithm can be divided into two steps: window updating and median value calculation

5 Velocity 0.2 Velocity Fig. 6. Window updating of the pixel at row 2 and column (a) (b) Fig. 8. Flow velocity image obtains from (a) CPU and (b) GPU. Fig. 7. Validity flag table of the pixel at row 2 and column 3. B. Window updating The window of each thread is a 25-element array, because the largest window in adaptive-size median filter is 5 5. To ensure that the window is allocated in registers, the window is implemented by 5 small arrays. The window updating is shown as Fig. 6. During window updating of i-th column, only 5 pixels from (i-3)-th column is replaced by 5 pixels from (i+2)-th column, while the other 20 pixels remain unchanged. Hence, the shared memory access and the bank conflicts are reduced considerably. C. Median value calculation Before finding the median value, the filter size is selected by the criterion mentioned in section 2. We use a validity flag table to represent whether an element is valid for median value calculation, as shown in Fig. 7. For example, only the 9 corresponding elements in the 3 3 window are valid when 3 3 median filter is applied. Otherwise, all elements are valid. A median selection algorithm based on incomplete selection sort is proposed. The incomplete selection sort has two advantages of simple to implement and good performance in small size windows. Therefore, we adopt the incomplete selection sort algorithm and adapt the algorithm for row-level parallelism. The proposed algorithm uses the validity flags instead of swapping the elements to avoid the redundant comparisons of sorted elements. Due to no data swapping, the proposed algorithm has two major benefits. First, the latency is reduced because of the reduction of memory access. Second, the proposed algorithm is compatible with window updating, since the data in window remain unchanged during the median value calculation. The procedure of the proposed algorithm is described as follows. In the beginning, we find the maximums of the window by comparing the valid elements in the order of selection sort. Then, we record the maximum and disable the corresponding validity flag in the table. Also, the number of sorted elements is updated. After that, we repeat the procedure in order to find the maximum of the remaining valid elements. The median value is the maximum when the number of sorted elements reaches the threshold, which is 5 for 3 3 median filter and 13 for 5 5 median filter. Finally, the data is written back to the global memory from shared memory by coalesced access after the median filtering is completed. With coalesced access, the latency of data transaction is minimized. 4. SIMULATION RESULTS AND COMPARISONS The simulation environment of the CPU-based software implementation is Windows 7 OS with an Intel i CPU running at 3.10 GHz and coupled to 12 GB of RAM, and the GPU-based software implementation is on the NVIDIA Tesla C2075 with 480 cores. The synthetic data of carotid artery are used in the simulation, and the data were generated by Field II [6] program with simulation parameters of 1540 m/s sound speed, 5 MHz center frequency, 20 MHz sampling frequency and 3.5 KHz pulse repetition rate. Note that during the color Doppler engine process, all data is represented in single precision floating point number since the operations of single precision floating point number is optimized on GPU. Fig. 8 presents the flow velocity image of the proposed color Doppler engine from CPU and GPU. The frame size is 30 scan-lines with 3840 samples along the lines. The SCR of CPU and GPU implementation are 8.23 db and 8.06 db respectively. The execution time of the CPU-based implementation and GPU-based implementation are compared in Table 2. The total execution time of the CPU-based implementation is 422ms per frame in desktop PC. In contrast, the execution time of GPU-based implementation can be reduced to 7.7ms

6 Even with the extra time for data copy, the overall acceleration of GPU-based implementation is about 55.8 times faster and the frame rate is improved significantly from 2.4 fps to fps. Each kernel is executing much faster in the proposed GPU-based implementation. Especially the kernel 2, which achieves times acceleration due to the fully parallelized characteristic. Table 2. The comparison of execution time between CPU and GPU. Timing (ms) Intel(R)Core (TM) i GHz Tesla C2075 GPU Speed up Memory copy (CPU to GPU) 3.23 Kernel x Kernel x Kernel x Memory copy (GPU to CPU) 1.23 Total x [6] J. A. Jensen, Field: a program for simulating ultrasound systems, Medical and Biological Engineering & Computing, vol. 34, pp , [7] T.-H. Yu, S.-Y. Sun, C.-L. Ding, P.-C. Li and A.-Y. Wu, "Reconfigurable color Doppler DSP engine for high-frequency ultrasonic imaging systems," in IEEE Workshop on Signal Processing Systems (SiPS), pp , Oct [8] K.-T. Chang, C.-Z. Zhan, and A.-Y. Wu, Joint-Decision Adaptive Clutter Filter and Motion-Tracking Adaptive Persistence for Color Doppler Processing in Ultrasonic Systems, in IEEE Workshop on Signal Processing Systems (SiPS), pp , Oct [9] CUDA C Programming Guide, version 4.2, NVIDIA, 2012 [10] S. Bjaerum, H. Torp, and K. Kristoffersen, Clutter filters adapted to tissue motion in ultrasound color flow imaging, IEEE Trans. Ultrason. Ferroelectr. Freq. Control, vol. 49, no. 6, pp , CONCLUSIONS In this paper, we present the GPU-based implementation of our color Doppler engine. In order to adapt to the CUDAbased GPU platform, we examine the parallelism and data locality of our algorithms and modify them. The simulation results confirmed that our color Doppler engine is feasible on the GPU platform. The overall acceleration is up to 54.8 times with only slightly image quality lost caused by the intrinsic error of GPU computations. 6. ACKNOWLEDGMENT This work is supported by MOEA, Taiwan, under Grant no. 100-EC-17-A-19-S REFERENCES [1] Z.-L. Liu, C.-Z. Zhan, and A.-Y. (Andy) Wu, Motion artifact elimination algorithm with eigen-based clutter filter for color Doppler processing, to appear in Proc IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), May [2] C. Kasai, K. Namekawa, A. Koyano, and R. Omoto, Real-time twodimensional blood flow imaging using an autocorrelation technique, IEEE Trans. Sonics and Ultrasonics, vol.32, no.3, pp.458,464, May [3] C.-Z. Zhan, K.-T. Chang, Y.-H. Chen, P.-C. Li, and A.-Y. Wu, Motion-Tracking Adaptive Persistence and Adaptive-Size Median Filter for Color Doppler Processing in Ultrasonic Systems on Multi-core Platform, in Proc. IEEE Conference on Biomedical Circuits and Systems (BioCAS), pp.58-61, Nov [4] F. Sattar, L. Floreby, G. Salomonsson, and B. Lovstrom, Image enhancement based on a nonlinear multiscale method, IEEE Trans. Image Processing, vol. 6, pp , June [5] C.-Z. Zhan, Y.-L. Chen and A.-Y. Wu, "Iterative Superlinear- Convergence SVD Beamforming Algorithm and VLSI Architecture for MIMO-OFDM Systems," IEEE Trans. Signal Processing, vol. 60, pp , June

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors Michael Boyer, David Tarjan, Scott T. Acton, and Kevin Skadron University of Virginia IPDPS 2009 Outline Leukocyte

More information

A COST-EFFICIENT RESIDUAL PREDICTION VLSI ARCHITECTURE FOR H.264/AVC SCALABLE EXTENSION

A COST-EFFICIENT RESIDUAL PREDICTION VLSI ARCHITECTURE FOR H.264/AVC SCALABLE EXTENSION A COST-EFFICIENT RESIDUAL PREDICTION VLSI ARCHITECTURE FOR H.264/AVC SCALABLE EXTENSION Yi-Hau Chen, Tzu-Der Chuang, Chuan-Yung Tsai, Yu-Jen Chen, and Liang-Gee Chen DSP/IC Design Lab., Graduate Institute

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

A Fast Speckle Reduction Algorithm based on GPU for Synthetic Aperture Sonar

A Fast Speckle Reduction Algorithm based on GPU for Synthetic Aperture Sonar Vol.137 (SUComS 016), pp.8-17 http://dx.doi.org/1457/astl.016.137.0 A Fast Speckle Reduction Algorithm based on GPU for Synthetic Aperture Sonar Xu Kui 1, Zhong Heping 1, Huang Pan 1 1 Naval Institute

More information

A VLSI Architecture for H.264/AVC Variable Block Size Motion Estimation

A VLSI Architecture for H.264/AVC Variable Block Size Motion Estimation Journal of Automation and Control Engineering Vol. 3, No. 1, February 20 A VLSI Architecture for H.264/AVC Variable Block Size Motion Estimation Dam. Minh Tung and Tran. Le Thang Dong Center of Electrical

More information

Texture Sensitive Image Inpainting after Object Morphing

Texture Sensitive Image Inpainting after Object Morphing Texture Sensitive Image Inpainting after Object Morphing Yin Chieh Liu and Yi-Leh Wu Department of Computer Science and Information Engineering National Taiwan University of Science and Technology, Taiwan

More information

IN RECENT years, multimedia application has become more

IN RECENT years, multimedia application has become more 578 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 17, NO. 5, MAY 2007 A Fast Algorithm and Its VLSI Architecture for Fractional Motion Estimation for H.264/MPEG-4 AVC Video Coding

More information

Compressed Sensing Algorithm for Real-Time Doppler Ultrasound Image Reconstruction

Compressed Sensing Algorithm for Real-Time Doppler Ultrasound Image Reconstruction Mathematical Modelling and Applications 2017; 2(6): 75-80 http://www.sciencepublishinggroup.com/j/mma doi: 10.11648/j.mma.20170206.14 ISSN: 2575-1786 (Print); ISSN: 2575-1794 (Online) Compressed Sensing

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

REDUCING BEAMFORMING CALCULATION TIME WITH GPU ACCELERATED ALGORITHMS

REDUCING BEAMFORMING CALCULATION TIME WITH GPU ACCELERATED ALGORITHMS BeBeC-2014-08 REDUCING BEAMFORMING CALCULATION TIME WITH GPU ACCELERATED ALGORITHMS Steffen Schmidt GFaI ev Volmerstraße 3, 12489, Berlin, Germany ABSTRACT Beamforming algorithms make high demands on the

More information

Analysis and Architecture Design of Variable Block Size Motion Estimation for H.264/AVC

Analysis and Architecture Design of Variable Block Size Motion Estimation for H.264/AVC 0 Analysis and Architecture Design of Variable Block Size Motion Estimation for H.264/AVC Ching-Yeh Chen Shao-Yi Chien Yu-Wen Huang Tung-Chien Chen Tu-Chih Wang and Liang-Gee Chen August 16 2005 1 Manuscript

More information

Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin

Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin Department of Electrical Engineering, National Taiwan Normal University, Taipei, Taiwan Abstract String matching is the most

More information

RECENTLY, researches on gigabit wireless personal area

RECENTLY, researches on gigabit wireless personal area 146 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 55, NO. 2, FEBRUARY 2008 An Indexed-Scaling Pipelined FFT Processor for OFDM-Based WPAN Applications Yuan Chen, Student Member, IEEE,

More information

Double-Precision Matrix Multiply on CUDA

Double-Precision Matrix Multiply on CUDA Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices

More information

High Speed Pipelined Architecture for Adaptive Median Filter

High Speed Pipelined Architecture for Adaptive Median Filter Abstract High Speed Pipelined Architecture for Adaptive Median Filter D.Dhanasekaran, and **Dr.K.Boopathy Bagan *Assistant Professor, SVCE, Pennalur,Sriperumbudur-602105. **Professor, Madras Institute

More information

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION 6.1 INTRODUCTION Fuzzy logic based computational techniques are becoming increasingly important in the medical image analysis arena. The significant

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters 1 On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters N. P. Karunadasa & D. N. Ranasinghe University of Colombo School of Computing, Sri Lanka nishantha@opensource.lk, dnr@ucsc.cmb.ac.lk

More information

Parallelization of K-Means Clustering Algorithm for Data Mining

Parallelization of K-Means Clustering Algorithm for Data Mining Parallelization of K-Means Clustering Algorithm for Data Mining Hao JIANG a, Liyan YU b College of Computer Science and Engineering, Southeast University, Nanjing, China a hjiang@seu.edu.cn, b yly.sunshine@qq.com

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

Adaptive Doppler centroid estimation algorithm of airborne SAR

Adaptive Doppler centroid estimation algorithm of airborne SAR Adaptive Doppler centroid estimation algorithm of airborne SAR Jian Yang 1,2a), Chang Liu 1, and Yanfei Wang 1 1 Institute of Electronics, Chinese Academy of Sciences 19 North Sihuan Road, Haidian, Beijing

More information

A Matrix--Matrix Multiplication methodology for single/multi-core architectures using SIMD

A Matrix--Matrix Multiplication methodology for single/multi-core architectures using SIMD A Matrix--Matrix Multiplication methodology for single/multi-core architectures using SIMD KELEFOURAS, Vasileios , KRITIKAKOU, Angeliki and GOUTIS, Costas Available

More information

Design and Evaluation of I/O Strategies for Parallel Pipelined STAP Applications

Design and Evaluation of I/O Strategies for Parallel Pipelined STAP Applications Design and Evaluation of I/O Strategies for Parallel Pipelined STAP Applications Wei-keng Liao Alok Choudhary ECE Department Northwestern University Evanston, IL Donald Weiner Pramod Varshney EECS Department

More information

DUE to the high computational complexity and real-time

DUE to the high computational complexity and real-time IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 15, NO. 3, MARCH 2005 445 A Memory-Efficient Realization of Cyclic Convolution and Its Application to Discrete Cosine Transform Hun-Chen

More information

Automatic Video Caption Detection and Extraction in the DCT Compressed Domain

Automatic Video Caption Detection and Extraction in the DCT Compressed Domain Automatic Video Caption Detection and Extraction in the DCT Compressed Domain Chin-Fu Tsao 1, Yu-Hao Chen 1, Jin-Hau Kuo 1, Chia-wei Lin 1, and Ja-Ling Wu 1,2 1 Communication and Multimedia Laboratory,

More information

BER Guaranteed Optimization and Implementation of Parallel Turbo Decoding on GPU

BER Guaranteed Optimization and Implementation of Parallel Turbo Decoding on GPU 2013 8th International Conference on Communications and Networking in China (CHINACOM) BER Guaranteed Optimization and Implementation of Parallel Turbo Decoding on GPU Xiang Chen 1,2, Ji Zhu, Ziyu Wen,

More information

Parallel Architecture & Programing Models for Face Recognition

Parallel Architecture & Programing Models for Face Recognition Parallel Architecture & Programing Models for Face Recognition Submitted by Sagar Kukreja Computer Engineering Department Rochester Institute of Technology Agenda Introduction to face recognition Feature

More information

Speed up a Machine-Learning-based Image Super-Resolution Algorithm on GPGPU

Speed up a Machine-Learning-based Image Super-Resolution Algorithm on GPGPU Speed up a Machine-Learning-based Image Super-Resolution Algorithm on GPGPU Ke Ma 1, and Yao Song 2 1 Department of Computer Sciences 2 Department of Electrical and Computer Engineering University of Wisconsin-Madison

More information

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management Hideyuki Shamoto, Tatsuhiro Chiba, Mikio Takeuchi Tokyo Institute of Technology IBM Research Tokyo Programming for large

More information

CHAPTER 6 HYBRID AI BASED IMAGE CLASSIFICATION TECHNIQUES

CHAPTER 6 HYBRID AI BASED IMAGE CLASSIFICATION TECHNIQUES CHAPTER 6 HYBRID AI BASED IMAGE CLASSIFICATION TECHNIQUES 6.1 INTRODUCTION The exploration of applications of ANN for image classification has yielded satisfactory results. But, the scope for improving

More information

A Novel Deblocking Filter Algorithm In H.264 for Real Time Implementation

A Novel Deblocking Filter Algorithm In H.264 for Real Time Implementation 2009 Third International Conference on Multimedia and Ubiquitous Engineering A Novel Deblocking Filter Algorithm In H.264 for Real Time Implementation Yuan Li, Ning Han, Chen Chen Department of Automation,

More information

THE discrete multi-valued neuron was presented by N.

THE discrete multi-valued neuron was presented by N. Proceedings of International Joint Conference on Neural Networks, Dallas, Texas, USA, August 4-9, 2013 Multi-Valued Neuron with New Learning Schemes Shin-Fu Wu and Shie-Jue Lee Department of Electrical

More information

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

Optimizing Data Locality for Iterative Matrix Solvers on CUDA Optimizing Data Locality for Iterative Matrix Solvers on CUDA Raymond Flagg, Jason Monk, Yifeng Zhu PhD., Bruce Segee PhD. Department of Electrical and Computer Engineering, University of Maine, Orono,

More information

Scanner Parameter Estimation Using Bilevel Scans of Star Charts

Scanner Parameter Estimation Using Bilevel Scans of Star Charts ICDAR, Seattle WA September Scanner Parameter Estimation Using Bilevel Scans of Star Charts Elisa H. Barney Smith Electrical and Computer Engineering Department Boise State University, Boise, Idaho 8375

More information

Accelerating the acceleration search a case study. By Chris Laidler

Accelerating the acceleration search a case study. By Chris Laidler Accelerating the acceleration search a case study By Chris Laidler Optimization cycle Assess Test Parallelise Optimise Profile Identify the function or functions in which the application is spending most

More information

An Enhanced Mixed-Scaling-Rotation CORDIC algorithm with Weighted Amplifying Factor

An Enhanced Mixed-Scaling-Rotation CORDIC algorithm with Weighted Amplifying Factor SEAS-WP-2016-10-001 An Enhanced Mixed-Scaling-Rotation CORDIC algorithm with Weighted Amplifying Factor Jaina Mehta jaina.mehta@ahduni.edu.in Pratik Trivedi pratik.trivedi@ahduni.edu.in Serial: SEAS-WP-2016-10-001

More information

arxiv: v1 [physics.ins-det] 11 Jul 2015

arxiv: v1 [physics.ins-det] 11 Jul 2015 GPGPU for track finding in High Energy Physics arxiv:7.374v [physics.ins-det] Jul 5 L Rinaldi, M Belgiovine, R Di Sipio, A Gabrielli, M Negrini, F Semeria, A Sidoti, S A Tupputi 3, M Villa Bologna University

More information

Performance impact of dynamic parallelism on different clustering algorithms

Performance impact of dynamic parallelism on different clustering algorithms Performance impact of dynamic parallelism on different clustering algorithms Jeffrey DiMarco and Michela Taufer Computer and Information Sciences, University of Delaware E-mail: jdimarco@udel.edu, taufer@udel.edu

More information

signal-to-noise ratio (PSNR), 2

signal-to-noise ratio (PSNR), 2 u m " The Integration in Optics, Mechanics, and Electronics of Digital Versatile Disc Systems (1/3) ---(IV) Digital Video and Audio Signal Processing ƒf NSC87-2218-E-009-036 86 8 1 --- 87 7 31 p m o This

More information

A New Configuration of Adaptive Arithmetic Model for Video Coding with 3D SPIHT

A New Configuration of Adaptive Arithmetic Model for Video Coding with 3D SPIHT A New Configuration of Adaptive Arithmetic Model for Video Coding with 3D SPIHT Wai Chong Chia, Li-Minn Ang, and Kah Phooi Seng Abstract The 3D Set Partitioning In Hierarchical Trees (SPIHT) is a video

More information

CT NOISE POWER SPECTRUM FOR FILTERED BACKPROJECTION AND ITERATIVE RECONSTRUCTION

CT NOISE POWER SPECTRUM FOR FILTERED BACKPROJECTION AND ITERATIVE RECONSTRUCTION CT NOISE POWER SPECTRUM FOR FILTERED BACKPROJECTION AND ITERATIVE RECONSTRUCTION Frank Dong, PhD, DABR Diagnostic Physicist, Imaging Institute Cleveland Clinic Foundation and Associate Professor of Radiology

More information

Design and Implementation of 3-D DWT for Video Processing Applications

Design and Implementation of 3-D DWT for Video Processing Applications Design and Implementation of 3-D DWT for Video Processing Applications P. Mohaniah 1, P. Sathyanarayana 2, A. S. Ram Kumar Reddy 3 & A. Vijayalakshmi 4 1 E.C.E, N.B.K.R.IST, Vidyanagar, 2 E.C.E, S.V University

More information

3D Registration based on Normalized Mutual Information

3D Registration based on Normalized Mutual Information 3D Registration based on Normalized Mutual Information Performance of CPU vs. GPU Implementation Florian Jung, Stefan Wesarg Interactive Graphics Systems Group (GRIS), TU Darmstadt, Germany stefan.wesarg@gris.tu-darmstadt.de

More information

An Approach for Real Time Moving Object Extraction based on Edge Region Determination

An Approach for Real Time Moving Object Extraction based on Edge Region Determination An Approach for Real Time Moving Object Extraction based on Edge Region Determination Sabrina Hoque Tuli Department of Computer Science and Engineering, Chittagong University of Engineering and Technology,

More information

Towards Breast Anatomy Simulation Using GPUs

Towards Breast Anatomy Simulation Using GPUs Towards Breast Anatomy Simulation Using GPUs Joseph H. Chui 1, David D. Pokrajac 2, Andrew D.A. Maidment 3, and Predrag R. Bakic 4 1 Department of Radiology, University of Pennsylvania, Philadelphia PA

More information

Classification of Subject Motion for Improved Reconstruction of Dynamic Magnetic Resonance Imaging

Classification of Subject Motion for Improved Reconstruction of Dynamic Magnetic Resonance Imaging 1 CS 9 Final Project Classification of Subject Motion for Improved Reconstruction of Dynamic Magnetic Resonance Imaging Feiyu Chen Department of Electrical Engineering ABSTRACT Subject motion is a significant

More information

A reversible data hiding based on adaptive prediction technique and histogram shifting

A reversible data hiding based on adaptive prediction technique and histogram shifting A reversible data hiding based on adaptive prediction technique and histogram shifting Rui Liu, Rongrong Ni, Yao Zhao Institute of Information Science Beijing Jiaotong University E-mail: rrni@bjtu.edu.cn

More information

implementation using GPU architecture is implemented only from the viewpoint of frame level parallel encoding [6]. However, it is obvious that the mot

implementation using GPU architecture is implemented only from the viewpoint of frame level parallel encoding [6]. However, it is obvious that the mot Parallel Implementation Algorithm of Motion Estimation for GPU Applications by Tian Song 1,2*, Masashi Koshino 2, Yuya Matsunohana 2 and Takashi Shimamoto 1,2 Abstract The video coding standard H.264/AVC

More information

A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors

A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors Brent Bohnenstiehl and Bevan Baas Department of Electrical and Computer Engineering University of California, Davis {bvbohnen,

More information

Block-Based Connected-Component Labeling Algorithm Using Binary Decision Trees

Block-Based Connected-Component Labeling Algorithm Using Binary Decision Trees Sensors 2015, 15, 23763-23787; doi:10.3390/s150923763 Article OPEN ACCESS sensors ISSN 1424-8220 www.mdpi.com/journal/sensors Block-Based Connected-Component Labeling Algorithm Using Binary Decision Trees

More information

LOW COMPLEXITY SUBBAND ANALYSIS USING QUADRATURE MIRROR FILTERS

LOW COMPLEXITY SUBBAND ANALYSIS USING QUADRATURE MIRROR FILTERS LOW COMPLEXITY SUBBAND ANALYSIS USING QUADRATURE MIRROR FILTERS Aditya Chopra, William Reid National Instruments Austin, TX, USA Brian L. Evans The University of Texas at Austin Electrical and Computer

More information

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs C.-C. Su a, C.-W. Hsieh b, M. R. Smith b, M. C. Jermy c and J.-S. Wu a a Department of Mechanical Engineering, National Chiao Tung

More information

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3

More information

Image Denoising Based on Hybrid Fourier and Neighborhood Wavelet Coefficients Jun Cheng, Songli Lei

Image Denoising Based on Hybrid Fourier and Neighborhood Wavelet Coefficients Jun Cheng, Songli Lei Image Denoising Based on Hybrid Fourier and Neighborhood Wavelet Coefficients Jun Cheng, Songli Lei College of Physical and Information Science, Hunan Normal University, Changsha, China Hunan Art Professional

More information

VIDEO DENOISING BASED ON ADAPTIVE TEMPORAL AVERAGING

VIDEO DENOISING BASED ON ADAPTIVE TEMPORAL AVERAGING Engineering Review Vol. 32, Issue 2, 64-69, 2012. 64 VIDEO DENOISING BASED ON ADAPTIVE TEMPORAL AVERAGING David BARTOVČAK Miroslav VRANKIĆ Abstract: This paper proposes a video denoising algorithm based

More information

RESTORATION OF DEGRADED DOCUMENTS USING IMAGE BINARIZATION TECHNIQUE

RESTORATION OF DEGRADED DOCUMENTS USING IMAGE BINARIZATION TECHNIQUE RESTORATION OF DEGRADED DOCUMENTS USING IMAGE BINARIZATION TECHNIQUE K. Kaviya Selvi 1 and R. S. Sabeenian 2 1 Department of Electronics and Communication Engineering, Communication Systems, Sona College

More information

A Quantitative Approach for Textural Image Segmentation with Median Filter

A Quantitative Approach for Textural Image Segmentation with Median Filter International Journal of Advancements in Research & Technology, Volume 2, Issue 4, April-2013 1 179 A Quantitative Approach for Textural Image Segmentation with Median Filter Dr. D. Pugazhenthi 1, Priya

More information

Research Subject. Dynamics Computation and Behavior Capture of Human Figures (Nakamura Group)

Research Subject. Dynamics Computation and Behavior Capture of Human Figures (Nakamura Group) Research Subject Dynamics Computation and Behavior Capture of Human Figures (Nakamura Group) (1) Goal and summary Introduction Humanoid has less actuators than its movable degrees of freedom (DOF) which

More information

ESTIMATION OF THE BLOOD VELOCITY SPECTRUM USING A RECURSIVE LATTICE FILTER

ESTIMATION OF THE BLOOD VELOCITY SPECTRUM USING A RECURSIVE LATTICE FILTER Jørgen Arendt Jensen et al. Paper presented at the IEEE International Ultrasonics Symposium, San Antonio, Texas, 996: ESTIMATION OF THE BLOOD VELOCITY SPECTRUM USING A RECURSIVE LATTICE FILTER Jørgen Arendt

More information

ADVANCED IMAGE PROCESSING METHODS FOR ULTRASONIC NDE RESEARCH C. H. Chen, University of Massachusetts Dartmouth, N.

ADVANCED IMAGE PROCESSING METHODS FOR ULTRASONIC NDE RESEARCH C. H. Chen, University of Massachusetts Dartmouth, N. ADVANCED IMAGE PROCESSING METHODS FOR ULTRASONIC NDE RESEARCH C. H. Chen, University of Massachusetts Dartmouth, N. Dartmouth, MA USA Abstract: The significant progress in ultrasonic NDE systems has now

More information

A MULTIPOINT VIDEOCONFERENCE RECEIVER BASED ON MPEG-4 OBJECT VIDEO. Chih-Kai Chien, Chen-Yu Tsai, and David W. Lin

A MULTIPOINT VIDEOCONFERENCE RECEIVER BASED ON MPEG-4 OBJECT VIDEO. Chih-Kai Chien, Chen-Yu Tsai, and David W. Lin A MULTIPOINT VIDEOCONFERENCE RECEIVER BASED ON MPEG-4 OBJECT VIDEO Chih-Kai Chien, Chen-Yu Tsai, and David W. Lin Dept. of Electronics Engineering and Center for Telecommunications Research National Chiao

More information

A ROBUST LONE DIAGONAL SORTING ALGORITHM FOR DENOISING OF IMAGES WITH SALT AND PEPPER NOISE

A ROBUST LONE DIAGONAL SORTING ALGORITHM FOR DENOISING OF IMAGES WITH SALT AND PEPPER NOISE International Journal of Computational Intelligence & Telecommunication Systems, 2(1), 2011, pp. 33-38 A ROBUST LONE DIAGONAL SORTING ALGORITHM FOR DENOISING OF IMAGES WITH SALT AND PEPPER NOISE Rajamani.

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

Parallel graph traversal for FPGA

Parallel graph traversal for FPGA LETTER IEICE Electronics Express, Vol.11, No.7, 1 6 Parallel graph traversal for FPGA Shice Ni a), Yong Dou, Dan Zou, Rongchun Li, and Qiang Wang National Laboratory for Parallel and Distributed Processing,

More information

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka USE OF FOR Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in Prague Mini workshop on advanced numerical methods

More information

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation

More information

A Fourier Extension Based Algorithm for Impulse Noise Removal

A Fourier Extension Based Algorithm for Impulse Noise Removal A Fourier Extension Based Algorithm for Impulse Noise Removal H. Sahoolizadeh, R. Rajabioun *, M. Zeinali Abstract In this paper a novel Fourier extension based algorithm is introduced which is able to

More information

Design of Navel Adaptive TDBLMS-based Wiener Parallel to TDBLMS Algorithm for Image Noise Cancellation

Design of Navel Adaptive TDBLMS-based Wiener Parallel to TDBLMS Algorithm for Image Noise Cancellation Design of Navel Adaptive TDBLMS-based Wiener Parallel to TDBLMS Algorithm for Image Noise Cancellation Dinesh Yadav 1, Ajay Boyat 2 1,2 Department of Electronics and Communication Medi-caps Institute of

More information

Image denoising in the wavelet domain using Improved Neigh-shrink

Image denoising in the wavelet domain using Improved Neigh-shrink Image denoising in the wavelet domain using Improved Neigh-shrink Rahim Kamran 1, Mehdi Nasri, Hossein Nezamabadi-pour 3, Saeid Saryazdi 4 1 Rahimkamran008@gmail.com nasri_me@yahoo.com 3 nezam@uk.ac.ir

More information

A Robust Color Image Watermarking Using Maximum Wavelet-Tree Difference Scheme

A Robust Color Image Watermarking Using Maximum Wavelet-Tree Difference Scheme A Robust Color Image Watermarking Using Maximum Wavelet-Tree ifference Scheme Chung-Yen Su 1 and Yen-Lin Chen 1 1 epartment of Applied Electronics Technology, National Taiwan Normal University, Taipei,

More information

MRT based Fixed Block size Transform Coding

MRT based Fixed Block size Transform Coding 3 MRT based Fixed Block size Transform Coding Contents 3.1 Transform Coding..64 3.1.1 Transform Selection...65 3.1.2 Sub-image size selection... 66 3.1.3 Bit Allocation.....67 3.2 Transform coding using

More information

FPGA IMPLEMENTATION OF FLOATING POINT ADDER AND MULTIPLIER UNDER ROUND TO NEAREST

FPGA IMPLEMENTATION OF FLOATING POINT ADDER AND MULTIPLIER UNDER ROUND TO NEAREST FPGA IMPLEMENTATION OF FLOATING POINT ADDER AND MULTIPLIER UNDER ROUND TO NEAREST SAKTHIVEL Assistant Professor, Department of ECE, Coimbatore Institute of Engineering and Technology Abstract- FPGA is

More information

Multiframe Blocking-Artifact Reduction for Transform-Coded Video

Multiframe Blocking-Artifact Reduction for Transform-Coded Video 276 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 12, NO. 4, APRIL 2002 Multiframe Blocking-Artifact Reduction for Transform-Coded Video Bahadir K. Gunturk, Yucel Altunbasak, and

More information

An adaptive container code character segmentation algorithm Yajie Zhu1, a, Chenglong Liang2, b

An adaptive container code character segmentation algorithm Yajie Zhu1, a, Chenglong Liang2, b 6th International Conference on Machinery, Materials, Environment, Biotechnology and Computer (MMEBC 2016) An adaptive container code character segmentation algorithm Yajie Zhu1, a, Chenglong Liang2, b

More information

AN HARDWARE ALGORITHM FOR REAL TIME IMAGE IDENTIFICATION 1

AN HARDWARE ALGORITHM FOR REAL TIME IMAGE IDENTIFICATION 1 730 AN HARDWARE ALGORITHM FOR REAL TIME IMAGE IDENTIFICATION 1 BHUVANESH KUMAR HALAN, 2 MANIKANDABABU.C.S 1 ME VLSI DESIGN Student, SRI RAMAKRISHNA ENGINEERING COLLEGE, COIMBATORE, India (Member of IEEE)

More information

High Quality DXT Compression using OpenCL for CUDA. Ignacio Castaño

High Quality DXT Compression using OpenCL for CUDA. Ignacio Castaño High Quality DXT Compression using OpenCL for CUDA Ignacio Castaño icastano@nvidia.com March 2009 Document Change History Version Date Responsible Reason for Change 0.1 02/01/2007 Ignacio Castaño First

More information

Dual-Mode Low-Complexity Codebook Searching Algorithm and VLSI Architecture for LTE/LTE-Advanced Systems

Dual-Mode Low-Complexity Codebook Searching Algorithm and VLSI Architecture for LTE/LTE-Advanced Systems IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL 61, NO 14, JULY 15, 2013 3545 Dual-Mode Low-Complexity Codebook Searching Algorithm and VLSI Architecture LTE/LTE-Advanced Systems Yi-Hsuan Lin, Yu-Hao Chen,

More information

Optimizing the Deblocking Algorithm for. H.264 Decoder Implementation

Optimizing the Deblocking Algorithm for. H.264 Decoder Implementation Optimizing the Deblocking Algorithm for H.264 Decoder Implementation Ken Kin-Hung Lam Abstract In the emerging H.264 video coding standard, a deblocking/loop filter is required for improving the visual

More information

Large and Sparse Mass Spectrometry Data Processing in the GPU Jose de Corral 2012 GPU Technology Conference

Large and Sparse Mass Spectrometry Data Processing in the GPU Jose de Corral 2012 GPU Technology Conference Large and Sparse Mass Spectrometry Data Processing in the GPU Jose de Corral 2012 GPU Technology Conference 2012 Waters Corporation 1 Agenda Overview of LC/IMS/MS 3D Data Processing 4D Data Processing

More information

Near Optimal Repair Rate Built-in Redundancy Analysis with Very Small Hardware Overhead

Near Optimal Repair Rate Built-in Redundancy Analysis with Very Small Hardware Overhead Near Optimal Repair Rate Built-in Redundancy Analysis with Very Small Hardware Overhead Woosung Lee, Keewon Cho, Jooyoung Kim, and Sungho Kang Department of Electrical & Electronic Engineering, Yonsei

More information

Using Intel Streaming SIMD Extensions for 3D Geometry Processing

Using Intel Streaming SIMD Extensions for 3D Geometry Processing Using Intel Streaming SIMD Extensions for 3D Geometry Processing Wan-Chun Ma, Chia-Lin Yang Dept. of Computer Science and Information Engineering National Taiwan University firebird@cmlab.csie.ntu.edu.tw,

More information

Improving the efficiency of Medical Image Segmentation based on Histogram Analysis

Improving the efficiency of Medical Image Segmentation based on Histogram Analysis Advances in Computational Sciences and Technology ISSN 0973-6107 Volume 10, Number 1 (2017) pp. 91-101 Research India Publications http://www.ripublication.com Improving the efficiency of Medical Image

More information

IMPLEMENTATION OF THE CONTRAST ENHANCEMENT AND WEIGHTED GUIDED IMAGE FILTERING ALGORITHM FOR EDGE PRESERVATION FOR BETTER PERCEPTION

IMPLEMENTATION OF THE CONTRAST ENHANCEMENT AND WEIGHTED GUIDED IMAGE FILTERING ALGORITHM FOR EDGE PRESERVATION FOR BETTER PERCEPTION IMPLEMENTATION OF THE CONTRAST ENHANCEMENT AND WEIGHTED GUIDED IMAGE FILTERING ALGORITHM FOR EDGE PRESERVATION FOR BETTER PERCEPTION Chiruvella Suresh Assistant professor, Department of Electronics & Communication

More information

Image Segmentation Based on Watershed and Edge Detection Techniques

Image Segmentation Based on Watershed and Edge Detection Techniques 0 The International Arab Journal of Information Technology, Vol., No., April 00 Image Segmentation Based on Watershed and Edge Detection Techniques Nassir Salman Computer Science Department, Zarqa Private

More information

Hardware Acceleration of Edge Detection Algorithm on FPGAs

Hardware Acceleration of Edge Detection Algorithm on FPGAs Hardware Acceleration of Edge Detection Algorithm on FPGAs Muthukumar Venkatesan and Daggu Venkateshwar Rao Department of Electrical and Computer Engineering University of Nevada Las Vegas. Las Vegas NV

More information

Fast Decision of Block size, Prediction Mode and Intra Block for H.264 Intra Prediction EE Gaurav Hansda

Fast Decision of Block size, Prediction Mode and Intra Block for H.264 Intra Prediction EE Gaurav Hansda Fast Decision of Block size, Prediction Mode and Intra Block for H.264 Intra Prediction EE 5359 Gaurav Hansda 1000721849 gaurav.hansda@mavs.uta.edu Outline Introduction to H.264 Current algorithms for

More information

CONTENT ADAPTIVE SCREEN IMAGE SCALING

CONTENT ADAPTIVE SCREEN IMAGE SCALING CONTENT ADAPTIVE SCREEN IMAGE SCALING Yao Zhai (*), Qifei Wang, Yan Lu, Shipeng Li University of Science and Technology of China, Hefei, Anhui, 37, China Microsoft Research, Beijing, 8, China ABSTRACT

More information

SIMULATING ARBITRARY-GEOMETRY ULTRASOUND TRANSDUCERS USING TRIANGLES

SIMULATING ARBITRARY-GEOMETRY ULTRASOUND TRANSDUCERS USING TRIANGLES Jørgen Arendt Jensen 1 Paper presented at the IEEE International Ultrasonics Symposium, San Antonio, Texas, 1996: SIMULATING ARBITRARY-GEOMETRY ULTRASOUND TRANSDUCERS USING TRIANGLES Jørgen Arendt Jensen,

More information

COMPUTATIONAL OPTIMIZATION OF A TIME-DOMAIN BEAMFORMING ALGORITHM USING CPU AND GPU

COMPUTATIONAL OPTIMIZATION OF A TIME-DOMAIN BEAMFORMING ALGORITHM USING CPU AND GPU BeBeC-214-9 COMPUTATIONAL OPTIMIZATION OF A TIME-DOMAIN BEAMFORMING ALGORITHM USING CPU AND GPU Johannes Stier, Christopher Hahn, Gero Zechel and Michael Beitelschmidt Technische Universität Dresden, Institute

More information

IMPROVED RHOMBUS INTERPOLATION FOR REVERSIBLE WATERMARKING BY DIFFERENCE EXPANSION. Catalin Dragoi, Dinu Coltuc

IMPROVED RHOMBUS INTERPOLATION FOR REVERSIBLE WATERMARKING BY DIFFERENCE EXPANSION. Catalin Dragoi, Dinu Coltuc 0th European Signal Processing Conference (EUSIPCO 01) Bucharest, Romania, August 7-31, 01 IMPROVED RHOMBUS INTERPOLATION FOR REVERSIBLE WATERMARKING BY DIFFERENCE EXPANSION Catalin Dragoi, Dinu Coltuc

More information

Image Error Concealment Based on Watermarking

Image Error Concealment Based on Watermarking Image Error Concealment Based on Watermarking Shinfeng D. Lin, Shih-Chieh Shie and Jie-Wei Chen Department of Computer Science and Information Engineering,National Dong Hwa Universuty, Hualien, Taiwan,

More information

Highly Efficient Compensationbased Parallelism for Wavefront Loops on GPUs

Highly Efficient Compensationbased Parallelism for Wavefront Loops on GPUs Highly Efficient Compensationbased Parallelism for Wavefront Loops on GPUs Kaixi Hou, Hao Wang, Wu chun Feng {kaixihou, hwang121, wfeng}@vt.edu Jeffrey S. Vetter, Seyong Lee vetter@computer.org, lees2@ornl.gov

More information

GPU-accelerated Verification of the Collatz Conjecture

GPU-accelerated Verification of the Collatz Conjecture GPU-accelerated Verification of the Collatz Conjecture Takumi Honda, Yasuaki Ito, and Koji Nakano Department of Information Engineering, Hiroshima University, Kagamiyama 1-4-1, Higashi Hiroshima 739-8527,

More information

Wavelet Based Image Compression Using ROI SPIHT Coding

Wavelet Based Image Compression Using ROI SPIHT Coding International Journal of Information & Computation Technology. ISSN 0974-2255 Volume 1, Number 2 (2011), pp. 69-76 International Research Publications House http://www.irphouse.com Wavelet Based Image

More information

Robust Lossless Image Watermarking in Integer Wavelet Domain using SVD

Robust Lossless Image Watermarking in Integer Wavelet Domain using SVD Robust Lossless Image Watermarking in Integer Domain using SVD 1 A. Kala 1 PG scholar, Department of CSE, Sri Venkateswara College of Engineering, Chennai 1 akala@svce.ac.in 2 K. haiyalnayaki 2 Associate

More information

A MORPHOLOGY-BASED FILTER STRUCTURE FOR EDGE-ENHANCING SMOOTHING

A MORPHOLOGY-BASED FILTER STRUCTURE FOR EDGE-ENHANCING SMOOTHING Proceedings of the 1994 IEEE International Conference on Image Processing (ICIP-94), pp. 530-534. (Austin, Texas, 13-16 November 1994.) A MORPHOLOGY-BASED FILTER STRUCTURE FOR EDGE-ENHANCING SMOOTHING

More information

Implementing a Speech Recognition System on a GPU using CUDA. Presented by Omid Talakoub Astrid Yi

Implementing a Speech Recognition System on a GPU using CUDA. Presented by Omid Talakoub Astrid Yi Implementing a Speech Recognition System on a GPU using CUDA Presented by Omid Talakoub Astrid Yi Outline Background Motivation Speech recognition algorithm Implementation steps GPU implementation strategies

More information

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam Convolution Soup: A case study in CUDA optimization The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam Optimization GPUs are very fast BUT Naïve programming can result in disappointing performance

More information

PFAC Library: GPU-Based String Matching Algorithm

PFAC Library: GPU-Based String Matching Algorithm PFAC Library: GPU-Based String Matching Algorithm Cheng-Hung Lin Lung-Sheng Chien Chen-Hsiung Liu Shih-Chieh Chang Wing-Kai Hon National Taiwan Normal University, Taipei, Taiwan National Tsing-Hua University,

More information

To Use or Not to Use: CPUs Cache Optimization Techniques on GPGPUs

To Use or Not to Use: CPUs Cache Optimization Techniques on GPGPUs To Use or Not to Use: CPUs Optimization Techniques on GPGPUs D.R.V.L.B. Thambawita Department of Computer Science and Technology Uva Wellassa University Badulla, Sri Lanka Email: vlbthambawita@gmail.com

More information