SEPARABLE FIR FILTERING IN FPGA AND GPU IMPLEMENTATIONS: ENERGY, PERFORMANCE, AND ACCURACY CONSIDERATIONS

Size: px
Start display at page:

Download "SEPARABLE FIR FILTERING IN FPGA AND GPU IMPLEMENTATIONS: ENERGY, PERFORMANCE, AND ACCURACY CONSIDERATIONS"

Transcription

1 st International Conference on Field Programmable Logic and Applications SEPARABLE FIR FILTERING IN FPGA AND GPU IMPLEMENTATIONS: ENERGY, PERFORMANCE, AND ACCURACY CONSIDERATIONS Daniel Llamocca, Cesar Carranza*, and Marios Pattichis Electrical and Computer Engineering Department, The University of New Mexico, Albuquerque, NM, 87131, USA *Sección Electricidad y Electrónica, Pontificia Universidad Católica del Perú, Lima 32, Perú dllamocca@ieee.org, acarran@pucp.edu.pe, pattichis@ece.unm.edu ABSTRACT Digital video processing requires significant hardware resources to achieve acceptable performance. Digital video processing based on dynamic partial reconfiguration (DPR) allows the designers to control resources based on energy, performance, and accuracy considerations. In this paper, we present a dynamically reconfigurable implementation of a 2D FIR filter where the number of coefficients and coefficients values can be varied to control energy, performance, and precision requirements. We also present a high-performance GPU implementation to help understand the trade-offs between these two technologies. Results using a standard example of 2D Difference of Gaussians (DOG) filter indicate that the DPR implementation can deliver real-time performance with energy per frame consumption that is an order of magnitude less than the GPU. On the other hand, at significantly higher energy consumption levels, the GPU implementation can deliver very high performance. 1. INTRODUCTION Hardware implementations of digital video processing methods are of great interest because of the ubiquitous applications. In terms of performance acceleration, many image and video processing algorithms require efficient implementation of 2D FIR filters [1]. Dynamic partial reconfiguration (DPR) allows FPGA designers to explore different implementations based on energy, performance, and accuracy requirements. In addition to DPR, efficient filter implementations are based on the direct use of LUTs [2], the distributed arithmetic technique [3], and separable designs [4, 5]. On the other hand, Graphic Processing Units (GPUs) offer high-performance floating point capabilities at significant energy consumption levels [6]. With the introduction of OpenCL and CUDA (Compute Unified Device Architecture), there has been a significant growth of GPU implementations; with [7] and [8] as examples of 2D FIR implementations.. In this paper, we are interested in exploring the energy, performance, and accuracy trade-offs between DPR FPGA and the corresponding GPU implementations. Some tradeoffs have been explored in [8] and [9]. Our goal is to provide recommendations for different implementations based on specific energy and performance requirements. The paper is organized as follows: Section 2 describes the embedded filter implementation on the FPGA. Section 3 details the GPU filter implementation. Section 4 explains the measurement setup for both implementations. Section 5 presents the results in terms of Energy, performance, and accuracy. Finally, Section 6 summarizes the paper. 2. 2D FIR FILTER SYSTEM ON THE FPGA We consider a 2D separable filtering implementation that is an extension of prior work presented in [3] and [4]. Here, we extend prior research to allow DPR of the entire FIR core, its FSL (Fast Simplex Link) bus interface, and the Partial Reconfiguration Region (PRR) control interface System Architecture Figure 1 depicts the block diagram of the embedded system. The 1D FIR Filter processor core and the PowerPC (PPC) interact via the Fast Simplex Link (FSL) bus. The PRR is reconfigured via the internal configuration access port (ICAP). The Compact Flash (CF) card holds the partial bitstreams and input data. Bus macros are no longer needed in the Xilinx ISE 12.2 Partial Reconfiguration Tools. In the context of the embedded system of Fig. 1 a 2D separable filter is realized by i) filtering the rows, ii) turning a row filter into a column filter via DPR, and iii) filtering the columns. Figure 2 depicts this scheme, where the 2D filter can be modified at run-time by using a different pair of row and column filters. The filtered images (row and column-wise) are stored in the DDRRAM. The row-wise filtered image is transposed in the memory before it is streamed to the column filter. The system is implemented in the ML405 Xilinx Dev. Board that houses a XC4VFX20 Virtex-4 FPGA. The PPC is clocked at 300 MHz and the peripherals at 100 MHz /11 $ IEEE DOI /FPL

2 PLB PPC FSL DPR ctrl FIR Filter processor FSL interface PRR FIR IP core Processing frame 1 tr1 tc1 Processing frame 2 tr2 tc2 PRR FPGA DDRRAM ICAP core System ACE Ethernet MAC 1 Frame is streamed ICAP port D FIR filter core CF card Figure 1. System Block Diagram This fixed-point core is based on the one presented in [3]. We allow for full-reconfiguration, i.e. the entire filter is included in the PRR. Several modifications are introduced: A new parameter B allows the specification of the input data bit-width. This is different from the coefficients bit-width. The frame size determines the input length of both the row and column filter. The input length is a parameter to the Finite State Machine (FSM) that controls the FSL interface. As a result, the FSL interface has to be included in the PRR (see Fig. 1). An interface that disables the PRR outputs during reconfiguration is required since the PRR outputs now include FSL interface signals (shown in Fig. 1). Here, the 2D filter requires 2 bitstreams: one for the row filter and one for the column filter. The PRR must accommodate the largest filter. 3. FILTER IMPLEMENTATION ON THE GPU We consider a parallel FIR algorithm implementation in the CUDA environment [10]. Here, parallelism is achieved by a grid that consists of blocks, with each block having a number of threads. All threads within a block are run in parallel from the software perspective. The actual number of blocks that can run in parallel is bounded by the number of streaming multiprocessors (SMs). Here, we can run a single block on each SM. Also, the number of threads that can be run in parallel at each SM is given by the number of CUDA cores inside each SM. For the purposes of this paper, we will report energy and performance measurements on the GPU (termed the device) as opposed to the CPU (termed the host). Here, GPU memory is divided into global memory, shared memory, constant memory, and texture memory row filter Replacing row filter by column filter via DPR DPR Processed row -by-row frame is streamed Replacing column filter by row filter via DPR col filter DPR 5 Go to Step 1 to process a new frame Figure 2. 2D separable FIR filter implementation. The algorithm exposes and exploits parallelism of the 2D FIR filter in order to obtain significant speed up gains. It is based on ideas exposed in [7]. Double precision (64 bits) is utilized. The filter symmetry and its separability are taken advantage of. We summarize the algorithm steps below: 1. The image and the filter kernel are transferred form the host to the device (global memory). 2. The image is then divided into blocks. Each image block is filtered by a thread block by rows. 3. The row-filtered image is also divided into blocks. Each image block is column- filtered by a thread block. 4. The final filtered image is transferred to the host. To further describe the algorithm, we let the input image to be of size HxW (H rows by W columns) and a filter kernel of size KxK (row and column filter of same length). We refer to [7] for more details on the separable implementation. Performance is achieved based on: i) loop unrolling, ii) storing image blocks in shared memory, and iii) storing the filtering coefficients in constant memory. Each image block is processed as follows: It is first loaded to the shared memory (with extra K 2 pixels on both sides for correct filtering). Then, for row filtering, each thread inside a block performs a point-wise multiplication between the row kernel and a row portion of the image; and then adds up each product producing an output pixel. This process continues until the filtered image block is obtained. 364

3 W K/2 coeffs Thread Block constant memory 4 p g DoG : σ =, σ = H image block shared memory y dim N = 48 x dim extra pixels Figure 3. Thread block configuration for row filtering Figure 3 shows the setup of a thread block for row filtering. Since all thread blocks work concurrently (from the software perspective), we are left with the row-filtering image in the global memory at the end of the previous process. This image (in blocks) is loaded again in shared memory, this time to perform column filtering. A thread block does not transpose the column-ordered data since the image block is small and it is not worth the effort. Thus, this division of the image in blocks effectively avoids transposing the entire image prior to column filtering. For row processing, the dimension of the thread block in the x direction must be higher or equal than K 2 (effective size of the kernel). For the dimension in y direction, any power of 2 is suitable as long as H is its multiple. For column processing, the dimension of the thread block in the y direction must be higher or equal than K 2. For the dimension in x direction, any power of 2 is suitable as long as W is its multiple. The device utilized is a NVIDIA GeForce GTX465, with 11x32 CUDA cores running at 607 MHz. There are 11 Streaming Multiprocessors that run at GHz, each with 32 CUDA cores. 1 GB of GDDR5 memory is available and runs at 1603 MHz with a bandwidth of GB/s. There are 48K bytes of shared memory per block. The maximum power dissipation of the board is 200 W. The GPUs are tested in a desktop environment with an Intel Xeon W3520 running at 2.67 GHz, with 6GB of DRAM. Our software configuration uses Windows 7 Ultimate (64-bits) with CUDA EXPERIMENTAL SETUP This section details how the results were obtained. The set of filters for our test are first described. Then we detail how performance, energy, and accuracy were measured Set of filters for testing To demonstrate the system, we consider a popular bandpass filter implementation based on the Difference of Gaussians (DOG) filter with σ 1 = 2, σ 2 = 4 [1]. For comparison, we implement the DOG filter using 48 coefficients and double precision arithmetic precision. Figure 4. Frequency response ideal filter with N = 48 The input image selected is the standard grayscale level (8 bits) image Lena. Fig. 4 shows the ideal frequency response of the filter with the input and output images. We consider 6 filter implementations, each with a different number of coefficients (N = 8, 12, 16, 20, 24, 32). In addition, we consider 3 different frame sizes: 640x480 (VGA), 352x288 (CIF), 176x144 (QCIF), derived from cropped versions of Lena (to preserve the frequencies). This results in 18 filtered images. In the case of the FPGA implementation, the bit-width of the coefficients is set at 16 bits. The row filter receives 8- bit pixels at the input and outputs 16-bit pixels. The column filter receives and outputs 16-bits pixels, taking advantage of the symmetry of the filter [3]. The system switches to a different 2D filter via DPR. This is realized by reconfiguring a different row filter at step 4 in Figure 2. Then, having streamed the image through the new row filter, we load the respective column filter at step 2. After that, we keep switching between these new row and column filters. As a result, the PRR size is that of largest column filter. In the case of the GPU implementation, the system is implemented with double floating point numerical precision, although it can be programmed with fixed-point Energy, performance, and accuracy measurements We measure performance in terms of frames per second (fps). In the case of the FPGA implementation, the processing time per frame includes: i) row filtering process, ii) column filtering process, iii) transposing row-filtered image, and iv) PRR reconfiguration (twice). The transposing of the row-filtered image occurs right after the filtering of the rows is completed. Two reconfigurations are needed per frame. Then, the performance (fps) is given by: fps FPGA = 1 ( trows + tcols + ttranspose + 2 treconfig ) (1) In the case of the GPU implementation, the processing time per frame includes: i) Allocation of memory and data transfer from host to device, ii) Frame processing, and iii) data transfer from device to host. We run the filters 1000 times and get an average quantity of each of these times. fps GPU =1 ( talloc+ transf ( h > d ) + t process + ttransf ( d > h ) ) (2) With regard to energy measurements, we consider the energy consumption per frame. 365

4 In the FPGA case, the power spent by the three Virtex-4 FPGA power sources (VCCINT, VCCAUX, VCCO) is obtained, which amounts to the embedded system power consumption. We use the Xilinx Power Analyzer (XPA) tool that provides a more accurate estimate than the Xilinx Power Estimator (XPE) because it is based on simulated switching activity of the place-and-routed circuit [11]. Our results are obtained with XPA at 25ºC. We get the power drawn by both the row ( P row ) and column filter ( P col ). Each filter variation amounts to a difference in resource usage, and in turn in different power consumption. However, the filter core is small compared to the rest of the embedded system, so the power difference is not noticeable. As a result, it is more useful to consider the power drawn (both P row and P cols ) just by the FIR Filter IP core. The power consumption during reconfiguration is an important quantity since the 2D FIR filter makes intensive use of DPR. Unfortunately, there is no tool available that can provide an estimate of this power consumption. In [12], hardware measurements determined that only the VCCAUX supply current increased during reconfiguration, and it increased by 25 ma for the XC4VFX12 device. This dynamic current does not depend on the device size, so we use this current for our XC4VFX20 device. The reconfiguration power then results: Preconfig row = Prow + Preconfig col = Pcol + ( 25mA) ( 25mA) VCCAUX VCCAUX Note that Preconfig row is the power during reconfiguration of the row filter into a column filter. Preconfig col is defined in a analogous fashion. With the processing times of the row and column filter, and the reconfiguration time, the energy per frame results: epffpga = Prow trows + Pcol tcols + ( Preconfig row + Preconfig col ) treconfig In the case of the GPU implementation, similarly to [6], the current is measured with the clamp sensor ESI 687 on the power connectors. Both the external power of the GPU and the power provided to the PCIe bus (20 W max.) are considered. Note that we measure the power consumption of the whole board that includes the GPU, memory, and other components. The average power during the tasks is measured, thus the energy per frame results: epfgpu = Paverage clamp ( t + t + t ) alloc+ transf ( h > d ) process transf ( d > h ) Since the transferring and allocation times can be considered as an offset any GPU implementation has to deal with, we might also be interested in measuring the energy per frame spent only during the processing stage: epf GPU = Paverage clamp t process (6) (3) (4) (5) Table 1. Embedded FIR Filtering system resource utilization (Virtex-4 XCVFX20-11FF672) Module Slice (%) FF (%) LUT % PRR (col filter) % % % Static Region % % % Overall % % % For accuracy measurements, we define accuracy as the relative error between the FPGA or GPU processed frame and the results using double precision with 48 coefficients. Consequently, we measure accuracy using the PSNR between the FPGA or GPU outputs and the double precision implementation (48 coefficients). Here, note that GPU implementation is also using double precision but with variable number of coefficients. On the other hand, for the FPGA, the error is due to truncation in the number of coefficients and the use of fixed-point arithmetic (16 bits). 5. RESULTS 5.1. FPGA resource usage and reconfiguration time The PRR must accommodate the largest filter (column filter with N = 32). Thus, the PRR occupies a tightly packed area of = 2160 Virtex-4 slices with a bitstream size of bytes. It takes about 25% of the FPGA fabric. Table 1 shows the hardware resource usage of the embedded FIR filtering system of Figure 1. It reveals the actual resource usage of the PRR and the static region. Note that the largest column filter (N = 32) occupies 2125 Slices (98% of the PRR Slices). A reconfiguration speed of 3.28 MB/s is obtained with the Xilinx ICAP core, resulting in ms of reconfiguration time for the given bitstream size Running times In the FPGA case, t rows and t cols are in line with the FSL transfer speed of 226 Mbps reported in [2]. For example, for N = 32, = rows t 10971, 3620, and 905 us for the VGA, CIF, and QCIF frame sizes respectively. The number of coefficients plays a negligible role in the processing time because the FIR filter is a fully pipelined system in which the number of coefficients only increments the register levels, which in turn increases the initial latency of the pipeline (that fades out for an input length larger than the number of coefficients). This effect is usually masked by the bus speed with bus cycles larger than the register levels of the pipeline. System performance is limited by the time spent in transposing the image (about 4152 us, 1453 us, and 379 us for the VGA, CIF, and QCIF frame sizes respectively) and the reconfiguration time (about ms). The reconfiguration time of ms achieved with the Xilinx ICAP controller significantly limits real-time system performance. With the use of the custom-made 366

5 Table 2. GPU running times (ms). N: number of coefficients 640x x x144 N t process t alloc+ transf ( h > d ) t transf ( d > h ) ICAP controller presented in [13], the reconfiguration time would be ms. For a good comparison with the GPU, this reconfiguration time is used instead. Note that the custom ICAP core has different power requirements than the Xilinx ICAP core. In practice, we expect the power difference to be negligible since the custom ICAP core is a small (and low-power) circuit. In the case of the GPU, we found that most of the time is consumed by the allocation of memory and data transfers from/to host to/from device. Table 2 shows these times. Note that t alloc+ transf ( h > d ) and t transf ( d > h ) are about the same for a given frame size. Also, the processing times do vary according to the number of coefficients, unlike in the case of the FPGAs Power measurements In the case of the FPGA, the power consumption is not dependent upon the frame size. Thus, it makes sense to report the result in terms of energy consumption per frame. Table 3 shows that the embedded system s power fluctuations due to the number of coefficients are negligible since only the filter IP core is modified. It is then more meaningful to consider the power of the FIR Filter core which does vary according to N (number of coefficients). Device static power does depend exclusively on the device size and operating temperature, called device static power [11]. It is consumed by the device when it is powered up and without programming the user logic. For the XCVFX20 device, it amounts to 166 mw (all 3 voltage rails), at 25 ºC. If the power results are to be meaningful across different devices, this quantity must be considered as an offset that will vary across devices. Table 3. Embedded system Power consumption (Watts) on the XCVFX20-11FF672 Virtex-4 FPGA P rows P cols Preconfig row Preconfig col Mean Std In the case of the GPU, we found that on average, it consumes 96.8, 92.5, and 88 Watts for VGA, CIF, and QCIF frame sizes respectively. Variations for different number of coefficients are negligible (around 0.1 W) since the algorithm uses the maximum number of cores regardless of the number of coefficients of the filter. The power fluctuations for different frame sizes are due to the fact that for smaller frame sizes, the GPU is moving data over a longer period of time than when it is processing Energy, Performance, and accuracy results For comparing energy consumption, we only consider the energy spent by the filtering process. Thus, for the FPGA, we consider the energy consumed by the FIR filter and the ICAP cores. For the GPUs, we will also consider the energy spent during actual video processing (Equation. 6). Figure 5 shows the energy per frame, performance (achieved frames per seconds) and accuracy results. Note that in the case of performance, we report the mean fps with its standard deviation for a given frame size. We observe an energy dependence on the number of coefficients in the FPGA case, although it is more pronounced in the GPU case. In addition, the performance dependence on the number of coefficients is negligible in the FPGA case, but noticeable in the GPU case. In terms of PSNR (db), the GPU gives better results due to its use of double precision. However, there is no significant difference at the output except for N = 32. In this case, we have very high PSNR values that exceed 80dB. In terms of performance, the GPU always prevails due to the massive amount of parallelization achieved in the algorithm coupled with the high operating frequencies. The speed up (GPU over FPGA) is about 9X, 5X, and 3.3X for VGA, CIF, and QCIF frame sizes respectively. For smaller frame sizes, the time consumed in allocations and transfers is closer to the processing times. In terms of energy per frame, the FPGA implementation is much better than the GPU. The GPU implementation consumes 6, 9, and 19 times more energy than the FPGA s for VGA, CIF, and QCIF frame sizes respectively. Our results suggest that the FPGA implementation provides a low-energy solution at near real-time performance. Here, we refer to frame rates that are over 30 fps as achieving real-time performance. On the other hand, when energy consumption is not an issue, the GPU implementation is superior, delivering much higher performance at slightly better accuracy. 367

6 Energy per frame - FIR core(mj) N = N = x x x144 FPGA results fps (avg) = , std = N = psnr(db) N = N = N = 8 fps (avg) = , std = fps (avg) = , std = Energy per frame - Processing stage (mj) N = GPU Results fps (avg) = 233, std = N = 24 fps (avg) = 408, std = N = 20 fps (avg) = , std = psnr(db) 40 N = x x x144 N = 12 N = Figure 5. Performance, energy, and accuracy results for both FPGA and GPU. N: number of coefficients 6. CONCLUSIONS This work successfully compares energy, performance (in frames per second), and accuracy for both FPGA and GPU implementations. Moreover, these 2 implementations allow the user to modify the 2D FIR Filter at run-time. The results indicate that separable 2D FIR filtering implementations can deliver excellent accuracy for both the FPGAs and the GPUs. However, based on energy consumption, FPGAs are preferred for low-energy applications. On the other hand, GPUs should be considered for high-performance, highpower (high-energy) applications. 7. REFERENCES [1] Alan Bovik ed., Handbook of Image and Video Processing. Academic Press, 1 st Edition, May [2] D. Llamocca, M. Pattichis, and A. Vera, A Dynamically Reconfigurable Parallel Pixel Processing System, in Proceedings of the International Conference on Field Programmable Logic and Applications FPL 2009, Prague, Czech Republic, Sep [3] D. Llamocca, M. Pattichis, and G. Alonzo Vera, Partial Reconfigurable FIR Filtering System using Distributed Arithmetic, International Journal of Reconfigurable Computing, vol. 2010, Article ID , 14 pages, [4] D. Llamocca, M. Pattichis, Real-time dynamically reconfigurable 2-D filterbanks, in Proceedings of the 2010 IEEE Southwest Symposium on Image Analysis & Interpretation, Austin, TX, May [5] Two-dimensional Linear Filtering (XAPP933) by Robert Turney, v1.1 ed., Xilinx Inc., 2100 Logic Drive, San Jose, CA, 95124, Oct [6] S. Collange, D. Defour, A. Tisserand, Power Consumption of GPUs from a Software Perspective, in Proceedings of the 9 th International Conference on Computational Science (ICCS 09), pp , Springer, [7] V. Podlozhnyuk, Image Convolution with CUDA, NVIDIA, June [8] Cope, B., Cheung, P.Y.K., Luk, W., Witt, S., Have GPUs made FPGAs redundant in the field of video processing?, in Proceedings of the 2005 IEEE International Conference on Field Programmable Technology, pp , Singapore, Dec [9] Jones, D.H., Powell, A., Bouganis, C.-S., Cheung, P.Y.K., GPU versus FPGA for High Productivity Computing, in Proceedings of the International Conference on Field Programmable Logic and Applications FPL 2010, Milan, Italy, Sep [10] CUDA C Programming Guide, NVIDIA, v 3.2, Sept [11] Power Methodology Guide (UG786), Xilinx, San Jose, CA, v13.1 edition, March [12] G.A. Vera, A dynamic arithmetic architecture: precision, power, and performance considerations, Ph.D. Dissertation, University of New Mexico, Albuquerque, NM, USA, May [13] C. Claus et al, A multi-platform controller allowing for maximum dynamic partial reconfiguration throughput, in Proceedings of the International Conference on Field Programmable Logic and Applications FPL 2008, pp , Heidelberg, Germany, Sept

A Dynamic Computing Platform for Image and Video Processing Applications

A Dynamic Computing Platform for Image and Video Processing Applications A Dynamic Computing Platform for Image and Video Processing Applications Daniel Llamocca, Marios Pattichis Electrical and Computer Engineering Department The University of New Mexico Albuquerque, NM, USA

More information

A DYNAMICALLY RECONFIGURABLE PARALLEL PIXEL PROCESSING SYSTEM. Daniel Llamocca, Marios Pattichis, and Alonzo Vera

A DYNAMICALLY RECONFIGURABLE PARALLEL PIXEL PROCESSING SYSTEM. Daniel Llamocca, Marios Pattichis, and Alonzo Vera A DYNAMICALLY RECONFIGURABLE PARALLEL PIXEL PROCESSING SYSTEM Daniel Llamocca, Marios Pattichis, and Alonzo Vera Electrical and Computer Engineering Department The University of New Mexico, Albuquerque,

More information

THERE IS A strong interest in developing effective methods

THERE IS A strong interest in developing effective methods 488 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 23, NO. 3, MARCH 2013 A Dynamically Reconfigurable Pixel Processor System Based on Power/Energy-Performance-Accuracy Optimization

More information

A Dynamically Reconfigurable Platform for Fixed-Point FIR Filters

A Dynamically Reconfigurable Platform for Fixed-Point FIR Filters 9 International Conference on Reconfigurable Computing and FPGAs A Dynamically Reconfigurable Platform for Fixed-Point FIR s Daniel lamocca, Marios Pattichis lectrical and Computer ngineering Department

More information

high performance medical reconstruction using stream programming paradigms

high performance medical reconstruction using stream programming paradigms high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming

More information

DRASTIC: Dynamically Reconfigurable Architecture Systems for Time-varying Image Constraints

DRASTIC: Dynamically Reconfigurable Architecture Systems for Time-varying Image Constraints DRASTIC: Dynamically Reconfigurable Architecture Systems for Time-varying Image Constraints Marios S. Pattichis image and video Processing and Communications Laboratory (ivpcl) Department of Electrical

More information

XIV International PhD Workshop OWD 2012, October Optimal structure of face detection algorithm using GPU architecture

XIV International PhD Workshop OWD 2012, October Optimal structure of face detection algorithm using GPU architecture XIV International PhD Workshop OWD 2012, 20 23 October 2012 Optimal structure of face detection algorithm using GPU architecture Dmitry Pertsau, Belarusian State University of Informatics and Radioelectronics

More information

An FPGA based rapid prototyping platform for wavelet coprocessors

An FPGA based rapid prototyping platform for wavelet coprocessors An FPGA based rapid prototyping platform for wavelet coprocessors Alonzo Vera a, Uwe Meyer-Baese b and Marios Pattichis a a University of New Mexico, ECE Dept., Albuquerque, NM87131 b FAMU-FSU, ECE Dept.,

More information

System Verification of Hardware Optimization Based on Edge Detection

System Verification of Hardware Optimization Based on Edge Detection Circuits and Systems, 2013, 4, 293-298 http://dx.doi.org/10.4236/cs.2013.43040 Published Online July 2013 (http://www.scirp.org/journal/cs) System Verification of Hardware Optimization Based on Edge Detection

More information

Is There A Tradeoff Between Programmability and Performance?

Is There A Tradeoff Between Programmability and Performance? Is There A Tradeoff Between Programmability and Performance? Robert Halstead Jason Villarreal Jacquard Computing, Inc. Roger Moussalli Walid Najjar Abstract While the computational power of Field Programmable

More information

A Configurable Multi-Ported Register File Architecture for Soft Processor Cores

A Configurable Multi-Ported Register File Architecture for Soft Processor Cores A Configurable Multi-Ported Register File Architecture for Soft Processor Cores Mazen A. R. Saghir and Rawan Naous Department of Electrical and Computer Engineering American University of Beirut P.O. Box

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

Optimized architectures of CABAC codec for IA-32-, DSP- and FPGAbased

Optimized architectures of CABAC codec for IA-32-, DSP- and FPGAbased Optimized architectures of CABAC codec for IA-32-, DSP- and FPGAbased platforms Damian Karwowski, Marek Domański Poznan University of Technology, Chair of Multimedia Telecommunications and Microelectronics

More information

RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC. Zoltan Baruch

RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC. Zoltan Baruch RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC Zoltan Baruch Computer Science Department, Technical University of Cluj-Napoca, 26-28, Bariţiu St., 3400 Cluj-Napoca,

More information

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FPGAs October 13, 2009 Overview Presenting: Alex Papakonstantinou, Karthik Gururaj, John Stratton, Jason Cong, Deming Chen, Wen-mei Hwu. FCUDA:

More information

Compute Node Design for DAQ and Trigger Subsystem in Giessen. Justus Liebig University in Giessen

Compute Node Design for DAQ and Trigger Subsystem in Giessen. Justus Liebig University in Giessen Compute Node Design for DAQ and Trigger Subsystem in Giessen Justus Liebig University in Giessen Outline Design goals Current work in Giessen Hardware Software Future work Justus Liebig University in Giessen,

More information

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FPGAs October 13, 2009 Overview Presenting: Alex Papakonstantinou, Karthik Gururaj, John Stratton, Jason Cong, Deming Chen, Wen-mei Hwu. FCUDA:

More information

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis Bruno da Silva, Jan Lemeire, An Braeken, and Abdellah Touhafi Vrije Universiteit Brussel (VUB), INDI and ETRO department, Brussels,

More information

Implementation of a FIR Filter on a Partial Reconfigurable Platform

Implementation of a FIR Filter on a Partial Reconfigurable Platform Implementation of a FIR Filter on a Partial Reconfigurable Platform Hanho Lee and Chang-Seok Choi School of Information and Communication Engineering Inha University, Incheon, 402-751, Korea hhlee@inha.ac.kr

More information

FPGA vs GPU Performance Comparison on the Implementation of FIR Filters

FPGA vs GPU Performance Comparison on the Implementation of FIR Filters FPGA vs GPU Performance Comparison on the Implementation of FIR Filters Abstract FIR filters find place in digital signal processing applications that require stopping a frequency band while passing another

More information

Accurate Measurement of Power Consumption Overhead During FPGA Dynamic Partial Reconfiguration

Accurate Measurement of Power Consumption Overhead During FPGA Dynamic Partial Reconfiguration Accurate Measurement of Power Consumption Overhead During FPGA Dynamic Partial Reconfiguration Amor Nafkha, and Yves Louet CentraleSupélec/IETR, Campus de Rennes Avenue de la Boulaie, CS 47601, 35576 Cesson

More information

Research Article A Dynamic Dual Fixed-Point Arithmetic Architecture for FPGAs

Research Article A Dynamic Dual Fixed-Point Arithmetic Architecture for FPGAs Hindawi Publishing Corporation International Journal of Reconfigurable Computing Volume 2011, Article ID 518602, 19 pages doi:10.1155/2011/518602 Research Article A Dynamic Dual Fixed-Point Arithmetic

More information

GPU-centric communication for improved efficiency

GPU-centric communication for improved efficiency GPU-centric communication for improved efficiency Benjamin Klenk *, Lena Oden, Holger Fröning * * Heidelberg University, Germany Fraunhofer Institute for Industrial Mathematics, Germany GPCDP Workshop

More information

Department of Electrical and Computer Engineering, University of New Mexico, Albuquerque, NM 87131, USA

Department of Electrical and Computer Engineering, University of New Mexico, Albuquerque, NM 87131, USA VLSI Design, Article ID 651943, 24 pages http://dx.doi.org/10.1155/2014/651943 Research Article A Self-Reconfigurable Platform for the Implementation of 2D Filterbanks with Real and Complex-Valued Inputs,

More information

CONVOLUTIONS and cross-correlations have wide

CONVOLUTIONS and cross-correlations have wide 2230 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 26, NO. 5, MAY 2017 Fast 2D Convolutions and Cross-Correlations Using Scalable Architectures Cesar Carranza, Member, IEEE, Daniel Llamocca, Senior Member,

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

Parallel FIR Filters. Chapter 5

Parallel FIR Filters. Chapter 5 Chapter 5 Parallel FIR Filters This chapter describes the implementation of high-performance, parallel, full-precision FIR filters using the DSP48 slice in a Virtex-4 device. ecause the Virtex-4 architecture

More information

"On the Capability and Achievable Performance of FPGAs for HPC Applications"

On the Capability and Achievable Performance of FPGAs for HPC Applications "On the Capability and Achievable Performance of FPGAs for HPC Applications" Wim Vanderbauwhede School of Computing Science, University of Glasgow, UK Or in other words "How Fast Can Those FPGA Thingies

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

implementation using GPU architecture is implemented only from the viewpoint of frame level parallel encoding [6]. However, it is obvious that the mot

implementation using GPU architecture is implemented only from the viewpoint of frame level parallel encoding [6]. However, it is obvious that the mot Parallel Implementation Algorithm of Motion Estimation for GPU Applications by Tian Song 1,2*, Masashi Koshino 2, Yuya Matsunohana 2 and Takashi Shimamoto 1,2 Abstract The video coding standard H.264/AVC

More information

Electrical and Computer Engineering ETDs

Electrical and Computer Engineering ETDs University of New Mexico UNM Digital Repository Electrical and Computer Engineering ETDs Engineering ETDs 6-9-2016 Fast and Scalable Architectures and Algorithms for the Computation of the Forward and

More information

Fast dynamic and partial reconfiguration Data Path

Fast dynamic and partial reconfiguration Data Path Fast dynamic and partial reconfiguration Data Path with low Michael Hübner 1, Diana Göhringer 2, Juanjo Noguera 3, Jürgen Becker 1 1 Karlsruhe Institute t of Technology (KIT), Germany 2 Fraunhofer IOSB,

More information

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA OPTIMIZATIONS ISC 2011 Tutorial CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays

Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays Éricles Sousa 1, Frank Hannig 1, Jürgen Teich 1, Qingqing Chen 2, and Ulf Schlichtmann

More information

Power Profiling and Optimization for Heterogeneous Multi-Core Systems

Power Profiling and Optimization for Heterogeneous Multi-Core Systems Power Profiling and Optimization for Heterogeneous Multi-Core Systems Kuen Hung Tsoi and Wayne Luk Department of Computing, Imperial College London {khtsoi, wl}@doc.ic.ac.uk ABSTRACT Processing speed and

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

A Parallel Decoding Algorithm of LDPC Codes using CUDA

A Parallel Decoding Algorithm of LDPC Codes using CUDA A Parallel Decoding Algorithm of LDPC Codes using CUDA Shuang Wang and Samuel Cheng School of Electrical and Computer Engineering University of Oklahoma-Tulsa Tulsa, OK 735 {shuangwang, samuel.cheng}@ou.edu

More information

Improving Reconfiguration Speed for Dynamic Circuit Specialization using Placement Constraints

Improving Reconfiguration Speed for Dynamic Circuit Specialization using Placement Constraints Improving Reconfiguration Speed for Dynamic Circuit Specialization using Placement Constraints Amit Kulkarni, Tom Davidson, Karel Heyse, and Dirk Stroobandt ELIS department, Computer Systems Lab, Ghent

More information

Parallel graph traversal for FPGA

Parallel graph traversal for FPGA LETTER IEICE Electronics Express, Vol.11, No.7, 1 6 Parallel graph traversal for FPGA Shice Ni a), Yong Dou, Dan Zou, Rongchun Li, and Qiang Wang National Laboratory for Parallel and Distributed Processing,

More information

Memory-efficient and fast run-time reconfiguration of regularly structured designs

Memory-efficient and fast run-time reconfiguration of regularly structured designs Memory-efficient and fast run-time reconfiguration of regularly structured designs Brahim Al Farisi, Karel Heyse, Karel Bruneel and Dirk Stroobandt Ghent University, ELIS Department Sint-Pietersnieuwstraat

More information

EECS150 - Digital Design Lecture 13 - Accelerators. Recap and Outline

EECS150 - Digital Design Lecture 13 - Accelerators. Recap and Outline EECS150 - Digital Design Lecture 13 - Accelerators Oct. 10, 2013 Prof. Ronald Fearing Electrical Engineering and Computer Sciences University of California, Berkeley (slides courtesy of Prof. John Wawrzynek)

More information

Performance potential for simulating spin models on GPU

Performance potential for simulating spin models on GPU Performance potential for simulating spin models on GPU Martin Weigel Institut für Physik, Johannes-Gutenberg-Universität Mainz, Germany 11th International NTZ-Workshop on New Developments in Computational

More information

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM

More information

Stratix vs. Virtex-II Pro FPGA Performance Analysis

Stratix vs. Virtex-II Pro FPGA Performance Analysis White Paper Stratix vs. Virtex-II Pro FPGA Performance Analysis The Stratix TM and Stratix II architecture provides outstanding performance for the high performance design segment, providing clear performance

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

FPGA IMPLEMENTATION FOR REAL TIME SOBEL EDGE DETECTOR BLOCK USING 3-LINE BUFFERS

FPGA IMPLEMENTATION FOR REAL TIME SOBEL EDGE DETECTOR BLOCK USING 3-LINE BUFFERS FPGA IMPLEMENTATION FOR REAL TIME SOBEL EDGE DETECTOR BLOCK USING 3-LINE BUFFERS 1 RONNIE O. SERFA JUAN, 2 CHAN SU PARK, 3 HI SEOK KIM, 4 HYEONG WOO CHA 1,2,3,4 CheongJu University E-maul: 1 engr_serfs@yahoo.com,

More information

SDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center

SDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center SDAccel Environment The Xilinx SDAccel Development Environment Bringing The Best Performance/Watt to the Data Center Introduction Data center operators constantly seek more server performance. Currently

More information

SECURE PARTIAL RECONFIGURATION OF FPGAs. Amir S. Zeineddini Kris Gaj

SECURE PARTIAL RECONFIGURATION OF FPGAs. Amir S. Zeineddini Kris Gaj SECURE PARTIAL RECONFIGURATION OF FPGAs Amir S. Zeineddini Kris Gaj Outline FPGAs Security Our scheme Implementation approach Experimental results Conclusions FPGAs SECURITY SRAM FPGA Security Designer/Vendor

More information

Understanding Outstanding Memory Request Handling Resources in GPGPUs

Understanding Outstanding Memory Request Handling Resources in GPGPUs Understanding Outstanding Memory Request Handling Resources in GPGPUs Ahmad Lashgar ECE Department University of Victoria lashgar@uvic.ca Ebad Salehi ECE Department University of Victoria ebads67@uvic.ca

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

Novel Dynamic Partial Reconfiguration Implementation of K- Means Clustering on FPGAs: Comparative Results with GPPs and GPUs

Novel Dynamic Partial Reconfiguration Implementation of K- Means Clustering on FPGAs: Comparative Results with GPPs and GPUs Edinburgh Research Explorer Novel Dynamic Partial Reconfiguration Implementation of K- Means Clustering on FPGAs: Comparative Results with GPPs and GPUs Citation for published version: Hussain, HM, Benkrid,

More information

Improving Energy Efficiency of Block-Matching Motion Estimation Using Dynamic Partial Reconfiguration

Improving Energy Efficiency of Block-Matching Motion Estimation Using Dynamic Partial Reconfiguration , pp.517-521 http://dx.doi.org/10.14257/astl.2015.1 Improving Energy Efficiency of Block-Matching Motion Estimation Using Dynamic Partial Reconfiguration Jooheung Lee 1 and Jungwon Cho 2, * 1 Dept. of

More information

XPERTS CORNER. Changing Utilization Rates in Real Time to Investigate FPGA Power Behavior

XPERTS CORNER. Changing Utilization Rates in Real Time to Investigate FPGA Power Behavior Changing Utilization Rates in Real Time to Investigate FPGA Power Behavior 60 Xcell Journal Fourth Quarter 2014 Power management in FPGA designs is always a critical issue. Here s a new way to measure

More information

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications Jeremy Fowers, Greg Brown, Patrick Cooke, Greg Stitt University of Florida Department of Electrical and

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes. HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes Ian Glendinning Outline NVIDIA GPU cards CUDA & OpenCL Parallel Implementation

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

GPU-Accelerated Parallel Sparse LU Factorization Method for Fast Circuit Analysis

GPU-Accelerated Parallel Sparse LU Factorization Method for Fast Circuit Analysis GPU-Accelerated Parallel Sparse LU Factorization Method for Fast Circuit Analysis Abstract: Lower upper (LU) factorization for sparse matrices is the most important computing step for circuit simulation

More information

OPTIMAL MAPPING MODE FOR XILINX FIELD- PROGRAMMABLE GATE ARRAY DESIGN FLOW

OPTIMAL MAPPING MODE FOR XILINX FIELD- PROGRAMMABLE GATE ARRAY DESIGN FLOW OPTIMAL MAPPING MODE FOR XILINX FIELD- PROGRAMMABLE GATE ARRAY DESIGN FLOW 1 MEHDI JEMAI, 2 SONIA DIMASSI, 3 BOURAOUI OUNI, 4 ABDELLATIF MTIBAA Laboratory of Electronic and microelectronic, University

More information

A SIMULINK-TO-FPGA MULTI-RATE HIERARCHICAL FIR FILTER DESIGN

A SIMULINK-TO-FPGA MULTI-RATE HIERARCHICAL FIR FILTER DESIGN A SIMULINK-TO-FPGA MULTI-RATE HIERARCHICAL FIR FILTER DESIGN Xiaoying Li 1 Fuming Sun 2 Enhua Wu 1, 3 1 University of Macau, Macao, China 2 University of Science and Technology Beijing, Beijing, China

More information

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,

More information

ReconOS: Multithreaded Programming and Execution Models for Reconfigurable Hardware

ReconOS: Multithreaded Programming and Execution Models for Reconfigurable Hardware ReconOS: Multithreaded Programming and Execution Models for Reconfigurable Hardware Enno Lübbers and Marco Platzner Computer Engineering Group University of Paderborn {enno.luebbers, platzner}@upb.de Outline

More information

Face Detection CUDA Accelerating

Face Detection CUDA Accelerating Face Detection CUDA Accelerating Jaromír Krpec Department of Computer Science VŠB Technical University Ostrava Ostrava, Czech Republic krpec.jaromir@seznam.cz Martin Němec Department of Computer Science

More information

Received 23 November 2010; Revised 12 April 2011; Accepted 7 June 2011

Received 23 November 2010; Revised 12 April 2011; Accepted 7 June 2011 Hindawi Publishing Corporation International Journal of Reconfigurable Computing Volume 2011, Article ID 439072, 10 pages doi:10.1155/2011/439072 Research Article A High-Speed Dynamic Partial Reconfiguration

More information

New Successes for Parameterized Run-time Reconfiguration

New Successes for Parameterized Run-time Reconfiguration New Successes for Parameterized Run-time Reconfiguration (or: use the FPGA to its true capabilities) Prof. Dirk Stroobandt Ghent University, Belgium Hardware and Embedded Systems group Universiteit Gent

More information

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE Abu Asaduzzaman, Deepthi Gummadi, and Chok M. Yip Department of Electrical Engineering and Computer Science Wichita State University Wichita, Kansas, USA

More information

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367

More information

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs C.-C. Su a, C.-W. Hsieh b, M. R. Smith b, M. C. Jermy c and J.-S. Wu a a Department of Mechanical Engineering, National Chiao Tung

More information

Research Article Partial Reconfigurable FIR Filtering System Using Distributed Arithmetic

Research Article Partial Reconfigurable FIR Filtering System Using Distributed Arithmetic Hindawi Publishing Corporation International Journal of Reconfigurable Computing Volume 21, Article ID 357978, 14 pages doi:11155/21/357978 Research Article Partial Reconfigurable FIR ing System Using

More information

RUN-TIME PARTIAL RECONFIGURATION SPEED INVESTIGATION AND ARCHITECTURAL DESIGN SPACE EXPLORATION

RUN-TIME PARTIAL RECONFIGURATION SPEED INVESTIGATION AND ARCHITECTURAL DESIGN SPACE EXPLORATION RUN-TIME PARTIAL RECONFIGURATION SPEED INVESTIGATION AND ARCHITECTURAL DESIGN SPACE EXPLORATION Ming Liu, Wolfgang Kuehn, Zhonghai Lu, Axel Jantsch II. Physics Institute Dept. of Electronic, Computer and

More information

PFAC Library: GPU-Based String Matching Algorithm

PFAC Library: GPU-Based String Matching Algorithm PFAC Library: GPU-Based String Matching Algorithm Cheng-Hung Lin Lung-Sheng Chien Chen-Hsiung Liu Shih-Chieh Chang Wing-Kai Hon National Taiwan Normal University, Taipei, Taiwan National Tsing-Hua University,

More information

CUDA Performance Optimization. Patrick Legresley

CUDA Performance Optimization. Patrick Legresley CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations

More information

A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors

A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors Brent Bohnenstiehl and Bevan Baas Department of Electrical and Computer Engineering University of California, Davis {bvbohnen,

More information

The Efficient Implementation of Numerical Integration for FPGA Platforms

The Efficient Implementation of Numerical Integration for FPGA Platforms Website: www.ijeee.in (ISSN: 2348-4748, Volume 2, Issue 7, July 2015) The Efficient Implementation of Numerical Integration for FPGA Platforms Hemavathi H Department of Electronics and Communication Engineering

More information

PERFORMANCE ANALYSIS OF HIGH EFFICIENCY LOW DENSITY PARITY-CHECK CODE DECODER FOR LOW POWER APPLICATIONS

PERFORMANCE ANALYSIS OF HIGH EFFICIENCY LOW DENSITY PARITY-CHECK CODE DECODER FOR LOW POWER APPLICATIONS American Journal of Applied Sciences 11 (4): 558-563, 2014 ISSN: 1546-9239 2014 Science Publication doi:10.3844/ajassp.2014.558.563 Published Online 11 (4) 2014 (http://www.thescipub.com/ajas.toc) PERFORMANCE

More information

FPGA Implementation of a Nonlinear Two Dimensional Fuzzy Filter

FPGA Implementation of a Nonlinear Two Dimensional Fuzzy Filter Justin G. R. Delva, Ali M. Reza, and Robert D. Turney + + CORE Solutions Group, Xilinx San Jose, CA 9514-3450, USA Department of Electrical Engineering and Computer Science, UWM Milwaukee, Wisconsin 5301-0784,

More information

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Fundamental Optimizations (GTC 2010) Paulius Micikevicius NVIDIA Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Optimization

More information

arxiv: v1 [physics.ins-det] 11 Jul 2015

arxiv: v1 [physics.ins-det] 11 Jul 2015 GPGPU for track finding in High Energy Physics arxiv:7.374v [physics.ins-det] Jul 5 L Rinaldi, M Belgiovine, R Di Sipio, A Gabrielli, M Negrini, F Semeria, A Sidoti, S A Tupputi 3, M Villa Bologna University

More information

Exploring Automatically Generated Platforms in High Performance FPGAs

Exploring Automatically Generated Platforms in High Performance FPGAs Exploring Automatically Generated Platforms in High Performance FPGAs Panagiotis Skrimponis b, Georgios Zindros a, Ioannis Parnassos a, Muhsen Owaida b, Nikolaos Bellas a, and Paolo Ienne b a Electrical

More information

Fundamentals of Quantitative Design and Analysis

Fundamentals of Quantitative Design and Analysis Fundamentals of Quantitative Design and Analysis Dr. Jiang Li Adapted from the slides provided by the authors Computer Technology Performance improvements: Improvements in semiconductor technology Feature

More information

Cost-and Power Optimized FPGA based System Integration: Methodologies and Integration of a Lo

Cost-and Power Optimized FPGA based System Integration: Methodologies and Integration of a Lo Cost-and Power Optimized FPGA based System Integration: Methodologies and Integration of a Low-Power Capacity- based Measurement Application on Xilinx FPGAs Abstract The application of Field Programmable

More information

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 12, DECEMBER

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 12, DECEMBER IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 12, DECEMBER 2013 2179 SPREAD: A Streaming-Based Partially Reconfigurable Architecture and Programming Model Ying Wang, Xuegong

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and

More information

Mathematical computations with GPUs

Mathematical computations with GPUs Master Educational Program Information technology in applications Mathematical computations with GPUs GPU architecture Alexey A. Romanenko arom@ccfit.nsu.ru Novosibirsk State University GPU Graphical Processing

More information

ATS-GPU Real Time Signal Processing Software

ATS-GPU Real Time Signal Processing Software Transfer A/D data to at high speed Up to 4 GB/s transfer rate for PCIe Gen 3 digitizer boards Supports CUDA compute capability 2.0+ Designed to work with AlazarTech PCI Express waveform digitizers Optional

More information

arxiv: v1 [physics.comp-ph] 4 Nov 2013

arxiv: v1 [physics.comp-ph] 4 Nov 2013 arxiv:1311.0590v1 [physics.comp-ph] 4 Nov 2013 Performance of Kepler GTX Titan GPUs and Xeon Phi System, Weonjong Lee, and Jeonghwan Pak Lattice Gauge Theory Research Center, CTP, and FPRD, Department

More information

Lecture 1: Gentle Introduction to GPUs

Lecture 1: Gentle Introduction to GPUs CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 1: Gentle Introduction to GPUs Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Who Am I? Mohamed

More information

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering

More information

DESIGN AND IMPLEMENTATION OF 32-BIT CONTROLLER FOR INTERACTIVE INTERFACING WITH RECONFIGURABLE COMPUTING SYSTEMS

DESIGN AND IMPLEMENTATION OF 32-BIT CONTROLLER FOR INTERACTIVE INTERFACING WITH RECONFIGURABLE COMPUTING SYSTEMS DESIGN AND IMPLEMENTATION OF 32-BIT CONTROLLER FOR INTERACTIVE INTERFACING WITH RECONFIGURABLE COMPUTING SYSTEMS Ashutosh Gupta and Kota Solomon Raju Digital System Group, Central Electronics Engineering

More information

Dense matching GPU implementation

Dense matching GPU implementation Dense matching GPU implementation Author: Hailong Fu. Supervisor: Prof. Dr.-Ing. Norbert Haala, Dipl. -Ing. Mathias Rothermel. Universität Stuttgart 1. Introduction Correspondence problem is an important

More information

A Coprocessor for Accelerating Visual Information Processing

A Coprocessor for Accelerating Visual Information Processing A Coprocessor for Accelerating Visual Information Processing W. Stechele*), L. Alvado Cárcel**), S. Herrmann*), J. Lidón Simón**) *) Technische Universität München **) Universidad Politecnica de Valencia

More information

INTEGER SEQUENCE WINDOW BASED RECONFIGURABLE FIR FILTERS.

INTEGER SEQUENCE WINDOW BASED RECONFIGURABLE FIR FILTERS. INTEGER SEQUENCE WINDOW BASED RECONFIGURABLE FIR FILTERS Arulalan Rajan 1, H S Jamadagni 1, Ashok Rao 2 1 Centre for Electronics Design and Technology, Indian Institute of Science, India (mrarul,hsjam)@cedt.iisc.ernet.in

More information

John W. Romein. Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands

John W. Romein. Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands Signal Processing on GPUs for Radio Telescopes John W. Romein Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands 1 Overview radio telescopes six radio telescope algorithms on

More information

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka USE OF FOR Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in Prague Mini workshop on advanced numerical methods

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS Agenda Forming a GPGPU WG 1 st meeting Future meetings Activities Forming a GPGPU WG To raise needs and enhance information sharing A platform for knowledge

More information

Memory Systems IRAM. Principle of IRAM

Memory Systems IRAM. Principle of IRAM Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

Hardware Acceleration of Feature Detection and Description Algorithms on Low Power Embedded Platforms

Hardware Acceleration of Feature Detection and Description Algorithms on Low Power Embedded Platforms Hardware Acceleration of Feature Detection and Description Algorithms on LowPower Embedded Platforms Onur Ulusel, Christopher Picardo, Christopher Harris, Sherief Reda, R. Iris Bahar, School of Engineering,

More information

Parallel Processing SIMD, Vector and GPU s cont.

Parallel Processing SIMD, Vector and GPU s cont. Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP

More information