Bridging the Gap between FPGAs and Multi-Processor Architectures: A Video Processing Perspective

Size: px
Start display at page:

Download "Bridging the Gap between FPGAs and Multi-Processor Architectures: A Video Processing Perspective"

Transcription

1 Bridging the Gap between FPGAs and Multi-Processor Architectures: A Video Processing Perspective Ben Cope 1, Peter Y.K. Cheung 1 and Wayne Luk 2 1 Department of Electrical & Electronic Engineering, Imperial College London 2 Department of Computing, Imperial College London {benjamin.cope, p.cheung,w.luk}@imperial.ac.uk Abstract This work explores how the graphics processing unit (GPU) pipeline model can influence future multi-core architectures which include reconfigurable logic cores. The design challenges of implementing five algorithms on two field programmable gate arrays (FPGAs) and two GPUs are explained and performance results contrasted. Explored algorithm features include data dependence, flexible data reuse patterns and histogram generation. A customisable SystemC model, which demonstrates that features of the GPU pipeline can be transferred to a general multi-core architecture, is implemented. The customisations are: choice of processing unit (PU); processing pattern; and on-chip memory organisation. Example tradeoffs are: the choice of processing pattern for histogram equalisation; choice of number of PUs; and memory sizing for motion vector estimation. It is shown that a multi-core architecture can be optimised for video processing by combining a GPU pipeline with cores that support reconfigurable datapath operations. 1. Introduction Whilst dual-core CPUs are becoming commonplace, NVIDIA has released the GeForce 8800 GTX 128-core GPU. GPU developers overcome the design and programming issues of multi-core scalability through a constrained pipeline model. This puts the GPU in direct competition with the FPGA as an accelerator in a video processing system. This work explores how the pipeline model of the GPU can influence future multi-core video processing architectures to improve performance and reduce design effort. The core of this architecture can be a processor or FPGA logic. The contributions of this work are: 1) the presentation of FPGA and GPU design challenges and performance results for five algorithms; 2) a customisable multi-core SystemC model; 3) an exploration of model customisations; and 4) a summary of lessons for future multi-core architectures. The paper is organised as follows: section 2 outlines related literature; key GPU pipeline features are described in section 3; section 4 describes FPGA and GPU design challenges for five algorithms; FPGA and GPU performance results are analysed in section 5; section 6 explores the customisable model; and conclusions are detailed in section Related Work FPGA architectures can be used to exploit regular data access patterns and parallelism in video processing algorithms. Sonic-on-chip [1] is an example multi-core FPGA architecture which exploits these factors. One application of sonic-on-chip is 3-step non-full-search motion vector estimation (NFS-MVE). A four channel (core) implementation achieves a target throughput of 6.8 million pixels per second (MP/s) [1]. The GPU is another multi-core architecture capable of exploiting data access locality and parallelism in algorithms. Literature shows examples of one to two orders of magnitude performance improvements for the GPU over the CPU [2, 3, 4]. The work presented here combines the GPU pipeline with reconfigurable logic core flexibility. Previous work by the authors on comparing the FPGA and GPU focused on an algorithm s match to the GPU instruction set and the effect of changing arithmetic intensity [2]. The arithmetic intensity of an algorithm is the ratio of arithmetic operations to the number of memory accesses. The results in section 5 consider additionally algorithms which require data dependence, memory gather and flexible data reuse. Memory gather in this work requires an entire video frame to be used to produce a small number of output values, for example, to produce a cumulative histogram. These demonstrate higher level characteristics of the FPGA and GPU architectures. 3. GPU Architecture The GPU cores of the NVIDIA GeForce 6800 GT and 7800 GTX graphics platforms are considered in this work. Key GPU architecture features, relevant to this work, are explained below. The focus is on the fragment processing /07/$ IEEE 308

2 Algorithm Number of Memory Reuse Memory Access Arithmetic Accesses (per pixel) Potential Pattern Intensity Bi-cubic Interpolation 16 low-medium predictable low Histogram Equalisation (HE) bin Calculation [of cumulative histogram] 1 (gather) null predictable medium Application 1+1(lookup table) null predictable low 3-step non-full search motion ( ) high locally high vector estimation (NFS-MVE) [1] random Primary Colour Correction (PCCR) [2] 1 null predictable high 2D Convolution (size n n) [2] n 2 high predictable low Table 2. Characterisation of a representative set of video processing algorithms stage of the GPU pipeline. For further details and background on the entire GPU pipeline refer to [6]. The GPU pipeline exploits iteration level parallelism through multiple fragment processors organised in groups of four (known as quads). The number of quads varies with the GPU model as shown in Table 1. The GPU is scalable to multiple quads because of two constraints: no result sharing between fragment processors and a feed through pipeline. Without these constraints, multi-core scalability would be restricted due to synchronisation constraints. The feed through pipeline of the GPU ensures an external memory location can only be read from or written to during each processing pass. This means that some algorithms require a multi-pass implementation [3]. Two examples of this are described in section 4. A further feature which simplifies the GPU programming model, and promotes scalability, is that all fragment processors execute the same program code. This code is commonly referred to as a fragment shader. To support the above features of the GPU pipeline, an efficient memory hierarchy is required. The components of the hierarchy are off-chip memory, GPU caches and output buffering. The memory requirements of graphics rendering demand the large memory bandwidth between the GPU and off-chip memory shown in Table 1. A shared cache is used to exploit data reuse between off-chip memory and the fragment processors. Literature suggests a 4-way associative 16 KBytes cache [7]. 4. Benchmark Algorithms Table 2 characterises five algorithms which represent features common to video processing. Characteristics Model Fragment Memory Core (Release Year) Processors Bandwidth Clock GF 7800 GTX (2005) (6 quads) GBytes/sec MHz GF 6800 GT (2004) (4 quads) GBytes/sec MHz Table 1. Specifications of the NVIDIA GeForce 6800 GT and 7800 GTX [5] which present interesting design challenges for FPGA and GPU implementation are italicised and described below. Bi-cubic interpolation can be used to resize a video frame to an arbitrarily higher resolution. This flexibility is indicated by the low to medium reuse potential in Table 2. An FPGA architecture exploits data reuse through using onchip buffers to store rows of a video frame. This requires a complex heuristic to implement a flexible reuse strategy and thus perform arbitrary resizing. GPU implementation of flexible reuse potential is straightforward because this is a feature supported by its memory hierarchy. Histogram equalisation (HE) is a special case of memory access which requires memory gather. To put this another way HE calculation requires an entire video frame to be reduced to, in this example, a 256 bin cumulative histogram. This is typical of the 8-bit resolution of video data. A multi-pass GPU implementation is required. Four rendering passes are needed to calculate the histogram and a final pass for its application to a video frame to equalise the intensity spectrum. The FPGA implementation requires a ROM decoder coupled with 256 parallel accumulators for HE calculation and a lookup table for HE application. NFS-MVE is a three step algorithm which on each step, for one search window, computes nine (a 3 3 grid of) sum of absolute difference (SAD) calculations with overlapping windows in a reference frame. The reference frame window locations in steps 2 and 3 are dependent on a comparison of SAD results from previous steps. The access pattern for NFS-MVE is termed locally random because the memory access in steps 2 and 3 is data dependent within a spatially local area. For GPU implementation NFS- MVE also requires a multi-pass approach. Each of the three steps of NFS-MVE requires three GPU passes for a total of nine rendering passes. The FPGA implementation of NFS- MVE is taken from work by Sedcole [1]. Implementation highlights of this were mentioned in section 2. The primary colour correction and 2D convolution algorithms show examples of high arithmetic intensity and variable memory access requirements respectfully [2]. For full implementation details and code for all the algorithms in Table 2 please contact the primary author. 309

3 Throughput Millions of Pixels per second (MP/s) Performance of Benchmark Algorithms on the FPGA and GPU Virtex 2 (Right) Virtex 4 GF 6800 GT GF 7800 GTX (left) 0 A B C 5x5 9x9 11x11 Primary H.Eq Non FS Interpolation 2D Convolution CCR MVE Figure 1. Throughput of Benchmarks on FPGA and GPU. Interpolation variants are to (A), to (B) and to (C) 5. Benchmarking the FPGA and GPU Performance results for the FPGA and GPU are shown in Figure 1. The FPGA implementations are coded in VHDL and synthesized using the Xilinx ISE 8.2i design suite. Post place and route speeds are shown. GPU code is compiled using the NVIDIA Cg Compiler version 1.5. The GLSL- Program library, created at Imperial College London, is used to create a layer of abstraction between C++ code and the OpenGL API. CPU cycle counts are used to measure GPU throughput. All benchmarks are implemented by the primary author with the exception of NFS-MVE on the FPGA [1]. The primary colour correction and 2D convolution results are updates from work published elsewhere [2]. An interesting result in Figure 1 is that the GeForce 7800 GTX implementation of primary colour correction outperforms the Virtex 4. This is the opposite of the result shown in prior work [2] on the comparison of the then state of the art GeForce 6800 GT and Virtex 2. This demonstrates the rapid performance improvement of modern GPUs. For bi-cubic interpolation, the GeForce 7800 GTX shows superior performance of up to three times over the FPGA. GPUs are well suited to algorithms with arbitrary reuse patterns, over a small window size (4 4 for bi-cubic interpolation), because this mimics graphics rendering. The FPGA speed is limited by the complex heuristic required for reuse pattern control logic. This is exemplified further by comparing the throughput of the FPGA interpolation and 2D convolution (size 5 5) implementations. 2D convolution has higher throughput despite the larger window size. This is due to the complex data reuse control logic required to implement arbitrary bi-cubic interpolation on the FPGA. For histogram equalisation (HE) the FPGA outperforms the GPU by over three times. This is due to the five pass algorithm, described in section 4, required for GPU implementation. The four passes required for cumulative histogram generation take 92% of this execution time. This highlights a limitation of the GPU when the designer desires to gather data from an entire frame to, for example, generate a cumulative histogram. This is a result of the limitation that fragment processors cannot share computation results and additionally that each fragment processor cannot hold computation results between pixels which it processes. These limitations are explored further in section 6.4. Data dependence for NFS-MVE is considered by using two video test sources provided by the project sponsor. GPU performance results are indistinguishable for any degree of video motion. This is a product of the GPU s heritage of computer graphics rendering which requires high performance for locally random memory accesses. NFS-MVE appears slow on the FPGA because the architecture is targeted at a desired throughput rate rather than maximal performance [1]. The performance of the GPU implementation of NFS-MVE is also low relative to the other algorithms. This is because of the large memory access requirements shown in Table 2. A multi-pass GPU implementation is not always a problem. For HE a more intuitive implementation can be made on the FPGA. However, for NFS-MVE the GPU approach was shown to have the highest performance despite a ninepass implementation. The requirement to implement a multi-pass approach is an acceptable sacrifice for the design and scalability benefits of the fixed GPU pipeline. A further interesting benefit of the GPU pipeline is its support for implementing arbitrary data reuse patterns. 310

4 Off-chip Memory Model 10 Cycle Estimates for Benchmarks on Model Pattern Generation Global Memory I/O On-Chip Memory MMU Processing Group 0 PU m... MMU Processing Group n PU m Output Buffer Cycle Estimate (10 6 ) GeForce 6800 GT (m=4,n=4) (m=4,n=6) (m=4,n=8) Figure 2. A Multi-Core Model (PU = Processing Unit, MMU = Memory Management Unit) 6. A Multi-Core Architecture This section explores the application of the GPU pipeline to a customisable multi-core architecture. In section 5 the GPU architecture was shown to be beneficial for implementing algorithms which require arbitrary data reuse but a limitation if generating a cumulative histogram. It is desired to include a subset of GPU pipeline features in a model of the customisable multi-core architecture. Two GPU pipeline features, identified in section 2 as key to the model s scalability, are: a feed-through pipeline; and no data sharing between fragment processors. The model s customisable options are: number of PUs (scalability); onchip memory; processing pattern; and choice of processing unit (PU). Section 6.1 describes the model. Three algorithms are used to verify the model and show scalability in section 6.2. In section 6.3 on-chip memory choice is explored through NFS-MVE. HE is used to explore the effect of a controlled processing pattern in section 6.4. Section 6.5 shows options for choice of processing unit. General lessons for future video processing architectures are presented in section The Model and Initial Setup The IEEE SystemC class library is chosen as the implementation platform. The flexibility of SystemC allows a high-level customisable multi-core model to be created. Figure 2 shows a high-level diagram of the model. This is a modified and simplified version of the fragment processing stage of the GPU pipeline [6]. The model and its differences from the GPU pipeline are described below. The multi-core hierarchy has m processing units (PUs) implemented in each of the n processing groups. Each processing group processes a block of m m pixels concurrently. For m = 4 and n = 4 this is similar to the GeForce 6800 GT fragment processor set-up. The feed-through pipeline is formed as follows. The pattern generation module supplies each processing group with 0 A B C 5x5 9x9 11x11 PCCR Interpolation 2D Convolution Figure 3. Performance trends for Benchmarks Implemented on the Multi-core Model for a pixel frame size the order in which to process pixels. The memory hierarchy is restricted such that data can only be fetched from one memory location and written to another. Outputs from the processing group are combined in an output buffer. The pixel processing order is produced by the pattern generation module and can be set to be any arbitrary sequence. This is in contrast to the GPU where the pixel processing order is determined by the output of the vertex processing and rasterisation pipeline stages. Processing units (PUs) connect to memory through a local memory management unit (MMU). Each MMU accesses pixels from the on-chip memory for m PUs. If the pixels are not available in on-chip memory the MMU arbitrates through the global I/O for off-chip memory access. Concurrent memory accesses by PUs are handled by the MMU. For conflicts between MMUs, concurrent memory accesses are handled by the global I/O. The round robin scheme is used for arbitration. This is similar to the method used for the texture unit (MMU) and fragment processors (PUs) in GPUs [8]. The initial on-chip memory setup is a 4-way associative 16k cache with 64 pixel cache lines. For verification purposes the PU model is a simple linear model of the GPU fragment processor. This models the execution time and memory access pattern of an algorithm. This allows high-level architecture exploration whilst keeping data level parallelism within each PU fixed. The initial processing order is chosen to be an iterative z-pattern [9]. This is a commonly used memory addressing pattern which exploits the spacial locality of neighbouring pixels Model Verification and Scalability In this section the model described in section 6.1 is verified and its scalability demonstrated. This is achieved through implementing three algorithms from Table 2 on the model. Figure 3 shows results for m = 4, n = 4, 6, 8 and a 311

5 frame size of pixels. The trend for the GeForce 6800 GT performance is plotted alongside these results. It can be seen that the trend of the model with m, n = 4 and the GeForce 6800 GT are similar. The model with m, n = 4 has a higher cycle count than the GeForce 6800 GT for interpolation and 2D convolution. The reason is that the GPU fragment processors are highly multi-threaded to hide memory access latency. This is a limitation of the model which does not include multi-threading. However, the aim is to model high level architecture features of the GPU and not the benefits of multi-threading. Model variations with n = 6, 8 are shown to demonstrate scalability. Primary colour correction has high arithmetic intensity and therefore cycle count for n = 8 is almost half that for n = 4. The cycle count ratio for interpolation is 1.8 times due to the behaviour of on-chip memory (cache) with increasing n. As n increases the situation arises where one PU may remove a data item from cache when another PU needs to reuse it. 2D convolution also scales less than two times for a doubling of processing groups. This effect becomes more prominent as convolution kernel size increases On-Chip Memory Flexibility The NFS-MVE benchmark requires nine passes for implementation on the GPU. This nine-pass GPU method is used to implement NFS-MVE on the model for m = 4 and n = 1, 2, 4. The results are shown on row one of Table 3. NFS-MVE is a memory intensive algorithm therefore onchip memory (cache) size is key to performance. The results for two and four-fold increases in cache size, keeping all other factors constant, are shown. The linear processing unit model, from section 6.1, is maintained for all results. Cache Model Model Model Changes (size) (m=4,n=1) (m=4,n=2) (m=4,n=4) Original (16k) 4.8 (17k) 3.6 (36k) 2.1 (70k) 2 (32k) 4.6 (12k) 2.5 (17k) 1.4 (30k) 4 (64k) 4.3 (9k) 2.2 (15k) 1.1 (16k) Table 3. Model performance in clock cycles 10 6 and (off-chip memory reads) for nine pass NFS-MVE over a video frame Table 3 demonstrates a key feature of the shared memory structure of the architecture of Figure 2. That is the effect the processing units have on each other by fetching new pixels to the cache. For memory intensive algorithms increasing the number of PUs by a factor of two will not achieve a two-fold performance improvement. In some cases this can be much less. This is demonstrated through the differences in model performance of varying n above. As memory size is increased the scaling of performance improvement with n also increases. This is because each processing unit now removes less data from cache that other PUs require. The number of memory reads to external memory verify these observations by reducing proportional to performance improvement between changes in cache size A Controlled Processing Pattern The benchmark implementations above have been implemented in the same manner as for the GPU. A benefit of the controllable processing pattern for the multi-core model is now shown. The performance of histogram equalisation (HE) on the GPU was shown in Figure 1 to be inferior to the FPGA. This is due to the GPU s inflexibility in implementing memory gather. An improved performance can be achieved on the model in Figure 2, over the GPU, because a PU can hold computation results between processed pixels. This is possible because the processing order is controllable. Consider distributing the histogram calculation across multiple PUs in Figure 2. For n, m = 4 this equates to 16 accumulators in each processing unit to generate the 256 bin cumulative histogram. Each PU must now access the entire frame of pixels for the full histogram to be computed. This can be achieved through correct choice of processing order. The result for HE calculation is shown in Table 4. Table 4 also shows the implementation of the original fourpass GPU technique on the model for comparison. For a fair comparison the linear PU model of section 6.2 is maintained for both model implementations. Therefore the improved method performs the accumulations in series. Model GF 6800 GT Original (four pass) Improvement (single pass) Table 4. Model performance in clock cycles 10 7 for HE calculation over a frame To verify the improved model implementation a frame sixteen times larger than the target frame size is rendered on the GeForce 6800 GT. This is used to model memory access and performance by modelling the compute latency of 16 accumulators on the GPU. The result is shown in Table 4. The improved implementation on the multi-core architecture provides a two-fold improvement over the original method. This is achieved by removing the multiple passes of the original HE design, keeping all other factors constant Processing Unit (PU) Choice The above analysis uses a fixed PU model to maintain an equal level of parallelism for each PU. Different PU options are now discussed through comparing the choice of a fragment processor PU with a reconfigurable datapath PU. A fragment processor can perform at peak eight (two 4-vector) FLOPs per cycle [6]. It exploits no inter-pixel data reuse and has a high core clock rate as shown in Table 1. For reconfigurable datapaths all these features are variables. The resource usages for four example FPGA 312

6 (reconfigurable datapath) implementations, taken from section 5, are shown in Table 5. The number of BRAMs indicates the degree of data reuse. Slice usage indicates the amount of parallelism. The clock rates for the FPGA implementations are lower than a fragment processor (clock rate equals FPGA throughput in Figure 1). The first observation is a tradeoff between the high clock rate and fixed parallelism of a fragment processor against the arbitrary data reuse and parallelism of a reconfigurable datapath. These are characteristics of a hardware-software tradeoff. Benchmark Slices BRAMs Bicubic Interpolation 2.7k 24 2D Convolution k 12 2D Convolution k 24 2D Convolution k 30 Primary Colour Correction 3.6k 0 Histogram Equalisation (HE) 10.6k 0 Table 5. Slice count and Block RAM usage for Benchmark Algorithms on the FPGA Fragment processors are designed to be scalable as outlined in section 3. For a reconfigurable datapath PU scalability depends on whether and how data reuse is exploited. For implementations of primary colour correction and HE on a reconfigurable datapath, no data reuse is exploited. This makes them inherently scalable. For 2D convolution and interpolation the data reuse strategy must be considered. For scalability to be possible, subsequent frames (or regions of a frame) must be assigned to different PUs. This can be implemented through controlling the processing pattern. The design effort to implement the interpolation and 2D convolution algorithms in a reconfigurable datapath must be traded against their data reuse and parallelism benefits. The requirement for a complex heuristic to implement flexible data reuse in a reconfigurable datapath makes a fragment processor PU the suitable option for interpolation. The extra parallelism available from a reconfigurable datapath makes it the most desirable for HE. For 2D convolution a regular data reuse pattern can be exploited which also makes the reconfigurable datapath PU the superior option Generalisations from Model Changes It has been shown that a subset of GPU pipeline features can be applied to a customisable multi-core model. A feedthrough pipeline and no data sharing between processing units (PUs) are features key to multi-core scalability. Customisations of number of PUs, memory organisation and processing pattern are necessary variables for a multi-core architecture. The generality of the multi-core architecture means that the option of a reconfigurable datapath PU can be considered. The choice between a fragment processor and reconfigurable datapath PU is a tradeoff between data reuse (which influences scalability) and parallelism. 7. Conclusion Five video processing algorithms have been implemented on the FPGA and GPU. Algorithms which require a regular data reuse pattern perform well on the FPGA. For irregular reuse patterns the GPU outperforms the FPGA. Histogram equalisation (HE) and NFS-MVE are two algorithms which require multi-pass GPU implementations. Although the GPU performance is 50 MP/s, an FPGA implementation of HE outperforms the GPU by three times. A customisable multi-core model based on the GPU pipeline has been implemented using the SystemC library. Features of the GPU pipeline and model customisations are explored through the model. A feed-through pipeline and no data sharing between PUs are two GPU pipeline features which are key to multi-core scalability. It is concluded that a multi-core architecture can be optimised for video processing by combining a GPU pipeline with cores that support reconfigurable datapath operations. Future work will involve exploration of other multi-core architecture features such as those of the STI Cell Broadband Engine and GeForce 8800 GTX. We gratefully acknowledge the support provided by the UK Engineering and Physical Sciences Research Council (EP/C549481/1), and Sony Broadcast & Professional Europe, and Jay Cornwall and Lee Howes from Imperial College London for providing GPU programming advice and the GLSLProgram class library. References [1] P. Sedcole. Reconfigurable platform-based design in FPGAs for video image processing. In PhD Thesis, University of London, pp , Jan [2] B. Cope, P.Y.K. Cheung, W. Luk, and S. Witt. Have GPUs made FPGAs redundant in the field of video processing? In Proc. Field Programmable Technology, pp , Dec [3] J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Krger, A. E. Lefohn, and T. J. Purcell. A survey of general-purpose computation on graphics hardware. In Eurographics 2005, State of the Art Reports, pp , Aug [4] P. Colantoni, N. Boukala, and J. D. Rugna. Fast and accurate color image processing using 3D graphics cards. In Proc. Vision, Modeling and Visualization, pp , [5] NVIDIA Corporation [6] M. Pharr and R. Fernando. GPU Gems 2. Addison Wesley, [7] V. Moya, C. Golzalez, J. Roca, and A. Fernandez. Shader performance analysis on a modern GPU architecture. In 38th Annual IEEE/ACM International Symposium on Microarchitecture, pp , [8] J. Lindholm, J. Nickolls, S. Moy, and B. Coon. Register based queuing for texture requests. United States Patent No. US 7,027,062 B2, [9] C. Priem, G. Solanki, and D. Kirk. Texture cache for a computer graphics accelerator. United States Patent No. US 7,136,068 B1,

Journal of Universal Computer Science, vol. 14, no. 14 (2008), submitted: 30/9/07, accepted: 30/4/08, appeared: 28/7/08 J.

Journal of Universal Computer Science, vol. 14, no. 14 (2008), submitted: 30/9/07, accepted: 30/4/08, appeared: 28/7/08 J. Journal of Universal Computer Science, vol. 14, no. 14 (2008), 2416-2427 submitted: 30/9/07, accepted: 30/4/08, appeared: 28/7/08 J.UCS Tabu Search on GPU Adam Janiak (Institute of Computer Engineering

More information

Data-Parallel Algorithms on GPUs. Mark Harris NVIDIA Developer Technology

Data-Parallel Algorithms on GPUs. Mark Harris NVIDIA Developer Technology Data-Parallel Algorithms on GPUs Mark Harris NVIDIA Developer Technology Outline Introduction Algorithmic complexity on GPUs Algorithmic Building Blocks Gather & Scatter Reductions Scan (parallel prefix)

More information

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,

More information

Bifurcation Between CPU and GPU CPUs General purpose, serial GPUs Special purpose, parallel CPUs are becoming more parallel Dual and quad cores, roadm

Bifurcation Between CPU and GPU CPUs General purpose, serial GPUs Special purpose, parallel CPUs are becoming more parallel Dual and quad cores, roadm XMT-GPU A PRAM Architecture for Graphics Computation Tom DuBois, Bryant Lee, Yi Wang, Marc Olano and Uzi Vishkin Bifurcation Between CPU and GPU CPUs General purpose, serial GPUs Special purpose, parallel

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

CS427 Multicore Architecture and Parallel Computing

CS427 Multicore Architecture and Parallel Computing CS427 Multicore Architecture and Parallel Computing Lecture 6 GPU Architecture Li Jiang 2014/10/9 1 GPU Scaling A quiet revolution and potential build-up Calculation: 936 GFLOPS vs. 102 GFLOPS Memory Bandwidth:

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

RECONFIGURABLE SYSTEMS FOR VIDEO PROCESSING

RECONFIGURABLE SYSTEMS FOR VIDEO PROCESSING RECONFIGURABLE SYSTEMS FOR VIDEO PROCESSING by BENJAMIN THOMAS COPE A report submitted in fulfilment of requirements for the MPhil to PhD transfer examination. Circuits and Systems Group Dept of Electrical

More information

high performance medical reconstruction using stream programming paradigms

high performance medical reconstruction using stream programming paradigms high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming

More information

A Detailed GPU Cache Model Based on Reuse Distance Theory

A Detailed GPU Cache Model Based on Reuse Distance Theory A Detailed GPU Cache Model Based on Reuse Distance Theory Cedric Nugteren, Gert-Jan van den Braak, Henk Corporaal Eindhoven University of Technology (Netherlands) Henri Bal Vrije Universiteit Amsterdam

More information

Computer and Hardware Architecture I. Benny Thörnberg Associate Professor in Electronics

Computer and Hardware Architecture I. Benny Thörnberg Associate Professor in Electronics Computer and Hardware Architecture I Benny Thörnberg Associate Professor in Electronics Hardware architecture Computer architecture The functionality of a modern computer is so complex that no human can

More information

PowerVR Hardware. Architecture Overview for Developers

PowerVR Hardware. Architecture Overview for Developers Public Imagination Technologies PowerVR Hardware Public. This publication contains proprietary information which is subject to change without notice and is supplied 'as is' without warranty of any kind.

More information

Reconfigurable Acceleration of Fitness Evaluation in Trading Strategies

Reconfigurable Acceleration of Fitness Evaluation in Trading Strategies Reconfigurable Acceleration of Fitness Evaluation in Trading Strategies INGRID FUNIE, PAUL GRIGORAS, PAVEL BUROVSKIY, WAYNE LUK, MARK SALMON Department of Computing Imperial College London Published in

More information

A Configurable Multi-Ported Register File Architecture for Soft Processor Cores

A Configurable Multi-Ported Register File Architecture for Soft Processor Cores A Configurable Multi-Ported Register File Architecture for Soft Processor Cores Mazen A. R. Saghir and Rawan Naous Department of Electrical and Computer Engineering American University of Beirut P.O. Box

More information

REDUCING BEAMFORMING CALCULATION TIME WITH GPU ACCELERATED ALGORITHMS

REDUCING BEAMFORMING CALCULATION TIME WITH GPU ACCELERATED ALGORITHMS BeBeC-2014-08 REDUCING BEAMFORMING CALCULATION TIME WITH GPU ACCELERATED ALGORITHMS Steffen Schmidt GFaI ev Volmerstraße 3, 12489, Berlin, Germany ABSTRACT Beamforming algorithms make high demands on the

More information

GPU Architecture and Function. Michael Foster and Ian Frasch

GPU Architecture and Function. Michael Foster and Ian Frasch GPU Architecture and Function Michael Foster and Ian Frasch Overview What is a GPU? How is a GPU different from a CPU? The graphics pipeline History of the GPU GPU architecture Optimizations GPU performance

More information

GPGPU. Peter Laurens 1st-year PhD Student, NSC

GPGPU. Peter Laurens 1st-year PhD Student, NSC GPGPU Peter Laurens 1st-year PhD Student, NSC Presentation Overview 1. What is it? 2. What can it do for me? 3. How can I get it to do that? 4. What s the catch? 5. What s the future? What is it? Introducing

More information

Lecture 25: Board Notes: Threads and GPUs

Lecture 25: Board Notes: Threads and GPUs Lecture 25: Board Notes: Threads and GPUs Announcements: - Reminder: HW 7 due today - Reminder: Submit project idea via (plain text) email by 11/24 Recap: - Slide 4: Lecture 23: Introduction to Parallel

More information

ACCELERATING SIGNAL PROCESSING ALGORITHMS USING GRAPHICS PROCESSORS

ACCELERATING SIGNAL PROCESSING ALGORITHMS USING GRAPHICS PROCESSORS ACCELERATING SIGNAL PROCESSING ALGORITHMS USING GRAPHICS PROCESSORS Ashwin Prasad and Pramod Subramanyan RF and Communications R&D National Instruments, Bangalore 560095, India Email: {asprasad, psubramanyan}@ni.com

More information

The S6000 Family of Processors

The S6000 Family of Processors The S6000 Family of Processors Today s Design Challenges The advent of software configurable processors In recent years, the widespread adoption of digital technologies has revolutionized the way in which

More information

Massively Parallel Computing on Silicon: SIMD Implementations. V.M.. Brea Univ. of Santiago de Compostela Spain

Massively Parallel Computing on Silicon: SIMD Implementations. V.M.. Brea Univ. of Santiago de Compostela Spain Massively Parallel Computing on Silicon: SIMD Implementations V.M.. Brea Univ. of Santiago de Compostela Spain GOAL Give an overview on the state-of of-the- art of Digital on-chip CMOS SIMD Solutions,

More information

Parallel FIR Filters. Chapter 5

Parallel FIR Filters. Chapter 5 Chapter 5 Parallel FIR Filters This chapter describes the implementation of high-performance, parallel, full-precision FIR filters using the DSP48 slice in a Virtex-4 device. ecause the Virtex-4 architecture

More information

Accelerating CFD with Graphics Hardware

Accelerating CFD with Graphics Hardware Accelerating CFD with Graphics Hardware Graham Pullan (Whittle Laboratory, Cambridge University) 16 March 2009 Today Motivation CPUs and GPUs Programming NVIDIA GPUs with CUDA Application to turbomachinery

More information

GPU Accelerating Speeded-Up Robust Features Timothy B. Terriberry, Lindley M. French, and John Helmsen

GPU Accelerating Speeded-Up Robust Features Timothy B. Terriberry, Lindley M. French, and John Helmsen GPU Accelerating Speeded-Up Robust Features Timothy B. Terriberry, Lindley M. French, and John Helmsen Overview of ArgonST Manufacturer of integrated sensor hardware and sensor analysis systems 2 RF, COMINT,

More information

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes. HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes Ian Glendinning Outline NVIDIA GPU cards CUDA & OpenCL Parallel Implementation

More information

From Brook to CUDA. GPU Technology Conference

From Brook to CUDA. GPU Technology Conference From Brook to CUDA GPU Technology Conference A 50 Second Tutorial on GPU Programming by Ian Buck Adding two vectors in C is pretty easy for (i=0; i

More information

Customisable EPIC Processor: Architecture and Tools

Customisable EPIC Processor: Architecture and Tools Customisable EPIC Processor: Architecture and Tools W.W.S. Chu, R.G. Dimond, S. Perrott, S.P. Seng and W. Luk Department of Computing, Imperial College London 180 Queen s Gate, London SW7 2BZ, UK Abstract

More information

Supporting Multithreading in Configurable Soft Processor Cores

Supporting Multithreading in Configurable Soft Processor Cores Supporting Multithreading in Configurable Soft Processor Cores Roger Moussali, Nabil Ghanem, and Mazen A. R. Saghir Department of Electrical and Computer Engineering American University of Beirut P.O.

More information

Threading Hardware in G80

Threading Hardware in G80 ing Hardware in G80 1 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA 2 3D 3D API: API: OpenGL OpenGL or or Direct3D Direct3D GPU Command &

More information

Vertex Shader Design I

Vertex Shader Design I The following content is extracted from the paper shown in next page. If any wrong citation or reference missing, please contact ldvan@cs.nctu.edu.tw. I will correct the error asap. This course used only

More information

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3

More information

GPUs and GPGPUs. Greg Blanton John T. Lubia

GPUs and GPGPUs. Greg Blanton John T. Lubia GPUs and GPGPUs Greg Blanton John T. Lubia PROCESSOR ARCHITECTURAL ROADMAP Design CPU Optimized for sequential performance ILP increasingly difficult to extract from instruction stream Control hardware

More information

Overview. Videos are everywhere. But can take up large amounts of resources. Exploit redundancy to reduce file size

Overview. Videos are everywhere. But can take up large amounts of resources. Exploit redundancy to reduce file size Overview Videos are everywhere But can take up large amounts of resources Disk space Memory Network bandwidth Exploit redundancy to reduce file size Spatial Temporal General lossless compression Huffman

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Overview of ROCCC 2.0

Overview of ROCCC 2.0 Overview of ROCCC 2.0 Walid Najjar and Jason Villarreal SUMMARY FPGAs have been shown to be powerful platforms for hardware code acceleration. However, their poor programmability is the main impediment

More information

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany Computing on GPUs Prof. Dr. Uli Göhner DYNAmore GmbH Stuttgart, Germany Summary: The increasing power of GPUs has led to the intent to transfer computing load from CPUs to GPUs. A first example has been

More information

Is There A Tradeoff Between Programmability and Performance?

Is There A Tradeoff Between Programmability and Performance? Is There A Tradeoff Between Programmability and Performance? Robert Halstead Jason Villarreal Jacquard Computing, Inc. Roger Moussalli Walid Najjar Abstract While the computational power of Field Programmable

More information

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367

More information

Custom computing systems

Custom computing systems Custom computing systems difference engine: Charles Babbage 1832 - compute maths tables digital orrery: MIT 1985 - special-purpose engine, found pluto motion chaotic Splash2: Supercomputing esearch Center

More information

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

GPU Computation Strategies & Tricks. Ian Buck NVIDIA GPU Computation Strategies & Tricks Ian Buck NVIDIA Recent Trends 2 Compute is Cheap parallelism to keep 100s of ALUs per chip busy shading is highly parallel millions of fragments per frame 0.5mm 64-bit

More information

Hardware Acceleration of Edge Detection Algorithm on FPGAs

Hardware Acceleration of Edge Detection Algorithm on FPGAs Hardware Acceleration of Edge Detection Algorithm on FPGAs Muthukumar Venkatesan and Daggu Venkateshwar Rao Department of Electrical and Computer Engineering University of Nevada Las Vegas. Las Vegas NV

More information

Comparison of High-Speed Ray Casting on GPU

Comparison of High-Speed Ray Casting on GPU Comparison of High-Speed Ray Casting on GPU using CUDA and OpenGL November 8, 2008 NVIDIA 1,2, Andreas Weinlich 1, Holger Scherl 2, Markus Kowarschik 2 and Joachim Hornegger 1 1 Chair of Pattern Recognition

More information

Power Profiling and Optimization for Heterogeneous Multi-Core Systems

Power Profiling and Optimization for Heterogeneous Multi-Core Systems Power Profiling and Optimization for Heterogeneous Multi-Core Systems Kuen Hung Tsoi and Wayne Luk Department of Computing, Imperial College London {khtsoi, wl}@doc.ic.ac.uk ABSTRACT Processing speed and

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

What s New with GPGPU?

What s New with GPGPU? What s New with GPGPU? John Owens Assistant Professor, Electrical and Computer Engineering Institute for Data Analysis and Visualization University of California, Davis Microprocessor Scaling is Slowing

More information

The future is parallel but it may not be easy

The future is parallel but it may not be easy The future is parallel but it may not be easy Michael J. Flynn Maxeler and Stanford University M. J. Flynn 1 HiPC Dec 07 Outline I The big technology tradeoffs: area, time, power HPC: What s new at the

More information

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing

More information

Real-Time Rendering Architectures

Real-Time Rendering Architectures Real-Time Rendering Architectures Mike Houston, AMD Part 1: throughput processing Three key concepts behind how modern GPU processing cores run code Knowing these concepts will help you: 1. Understand

More information

Efficient Stream Reduction on the GPU

Efficient Stream Reduction on the GPU Efficient Stream Reduction on the GPU David Roger Grenoble University Email: droger@inrialpes.fr Ulf Assarsson Chalmers University of Technology Email: uffe@chalmers.se Nicolas Holzschuch Cornell University

More information

A SIMD-efficient 14 Instruction Shader Program for High-Throughput Microtriangle Rasterization

A SIMD-efficient 14 Instruction Shader Program for High-Throughput Microtriangle Rasterization A SIMD-efficient 14 Instruction Shader Program for High-Throughput Microtriangle Rasterization Jordi Roca Victor Moya Carlos Gonzalez Vicente Escandell Albert Murciego Agustin Fernandez, Computer Architecture

More information

Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA

Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA Yun R. Qu, Viktor K. Prasanna Ming Hsieh Dept. of Electrical Engineering University of Southern California Los Angeles, CA 90089

More information

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1 Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei

More information

Improved Integral Histogram Algorithm. for Big Sized Images in CUDA Environment

Improved Integral Histogram Algorithm. for Big Sized Images in CUDA Environment Contemporary Engineering Sciences, Vol. 7, 2014, no. 24, 1415-1423 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ces.2014.49174 Improved Integral Histogram Algorithm for Big Sized Images in CUDA

More information

Hammer Slide: Work- and CPU-efficient Streaming Window Aggregation

Hammer Slide: Work- and CPU-efficient Streaming Window Aggregation Large-Scale Data & Systems Group Hammer Slide: Work- and CPU-efficient Streaming Window Aggregation Georgios Theodorakis, Alexandros Koliousis, Peter Pietzuch, Holger Pirk Large-Scale Data & Systems (LSDS)

More information

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM

More information

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications Jeremy Fowers, Greg Brown, Patrick Cooke, Greg Stitt University of Florida Department of Electrical and

More information

XIV International PhD Workshop OWD 2012, October Optimal structure of face detection algorithm using GPU architecture

XIV International PhD Workshop OWD 2012, October Optimal structure of face detection algorithm using GPU architecture XIV International PhD Workshop OWD 2012, 20 23 October 2012 Optimal structure of face detection algorithm using GPU architecture Dmitry Pertsau, Belarusian State University of Informatics and Radioelectronics

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

Speed up a Machine-Learning-based Image Super-Resolution Algorithm on GPGPU

Speed up a Machine-Learning-based Image Super-Resolution Algorithm on GPGPU Speed up a Machine-Learning-based Image Super-Resolution Algorithm on GPGPU Ke Ma 1, and Yao Song 2 1 Department of Computer Sciences 2 Department of Electrical and Computer Engineering University of Wisconsin-Madison

More information

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka USE OF FOR Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in Prague Mini workshop on advanced numerical methods

More information

Advanced Deferred Rendering Techniques. NCCA, Thesis Portfolio Peter Smith

Advanced Deferred Rendering Techniques. NCCA, Thesis Portfolio Peter Smith Advanced Deferred Rendering Techniques NCCA, Thesis Portfolio Peter Smith August 2011 Abstract The following paper catalogues the improvements made to a Deferred Renderer created for an earlier NCCA project.

More information

General Purpose GPU Programming. Advanced Operating Systems Tutorial 9

General Purpose GPU Programming. Advanced Operating Systems Tutorial 9 General Purpose GPU Programming Advanced Operating Systems Tutorial 9 Tutorial Outline Review of lectured material Key points Discussion OpenCL Future directions 2 Review of Lectured Material Heterogeneous

More information

COE 561 Digital System Design & Synthesis Introduction

COE 561 Digital System Design & Synthesis Introduction 1 COE 561 Digital System Design & Synthesis Introduction Dr. Aiman H. El-Maleh Computer Engineering Department King Fahd University of Petroleum & Minerals Outline Course Topics Microelectronics Design

More information

Windowing System on a 3D Pipeline. February 2005

Windowing System on a 3D Pipeline. February 2005 Windowing System on a 3D Pipeline February 2005 Agenda 1.Overview of the 3D pipeline 2.NVIDIA software overview 3.Strengths and challenges with using the 3D pipeline GeForce 6800 220M Transistors April

More information

A Reconfigurable Architecture for Load-Balanced Rendering

A Reconfigurable Architecture for Load-Balanced Rendering A Reconfigurable Architecture for Load-Balanced Rendering Jiawen Chen Michael I. Gordon William Thies Matthias Zwicker Kari Pulli Frédo Durand Graphics Hardware July 31, 2005, Los Angeles, CA The Load

More information

Optimisation Myths and Facts as Seen in Statistical Physics

Optimisation Myths and Facts as Seen in Statistical Physics Optimisation Myths and Facts as Seen in Statistical Physics Massimo Bernaschi Institute for Applied Computing National Research Council & Computer Science Department University La Sapienza Rome - ITALY

More information

Efficient Scan-Window Based Object Detection using GPGPU

Efficient Scan-Window Based Object Detection using GPGPU Efficient Scan-Window Based Object Detection using GPGPU Li Zhang and Ramakant Nevatia University of Southern California Institute of Robotics and Intelligent Systems {li.zhang nevatia}@usc.edu Abstract

More information

Introduction to FPGA Design with Vivado High-Level Synthesis. UG998 (v1.0) July 2, 2013

Introduction to FPGA Design with Vivado High-Level Synthesis. UG998 (v1.0) July 2, 2013 Introduction to FPGA Design with Vivado High-Level Synthesis Notice of Disclaimer The information disclosed to you hereunder (the Materials ) is provided solely for the selection and use of Xilinx products.

More information

Real-Time Reyes: Programmable Pipelines and Research Challenges. Anjul Patney University of California, Davis

Real-Time Reyes: Programmable Pipelines and Research Challenges. Anjul Patney University of California, Davis Real-Time Reyes: Programmable Pipelines and Research Challenges Anjul Patney University of California, Davis Real-Time Reyes-Style Adaptive Surface Subdivision Anjul Patney and John D. Owens SIGGRAPH Asia

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

Imaging Sensor with Integrated Feature Extraction Using Connected Component Labeling

Imaging Sensor with Integrated Feature Extraction Using Connected Component Labeling Imaging Sensor with Integrated Feature Extraction Using Connected Component Labeling M. Klaiber, S. Ahmed, M. Najmabadi, Y.Baroud, W. Li, S.Simon Institute for Parallel and Distributed Systems, Department

More information

Interleaved Pixel Lookup for Embedded Computer Vision

Interleaved Pixel Lookup for Embedded Computer Vision Interleaved Pixel Lookup for Embedded Computer Vision Kota Yamaguchi, Yoshihiro Watanabe, Takashi Komuro, and Masatoshi Ishikawa Graduate School of Information Science and Technology, The University of

More information

FPGA BASED ACCELERATION OF THE LINPACK BENCHMARK: A HIGH LEVEL CODE TRANSFORMATION APPROACH

FPGA BASED ACCELERATION OF THE LINPACK BENCHMARK: A HIGH LEVEL CODE TRANSFORMATION APPROACH FPGA BASED ACCELERATION OF THE LINPACK BENCHMARK: A HIGH LEVEL CODE TRANSFORMATION APPROACH Kieron Turkington, Konstantinos Masselos, George A. Constantinides Department of Electrical and Electronic Engineering,

More information

Accelerating high-end compositing with CUDA in NUKE. Jon Wadelton NUKE Product Manager

Accelerating high-end compositing with CUDA in NUKE. Jon Wadelton NUKE Product Manager Accelerating high-end compositing with CUDA in NUKE Jon Wadelton NUKE Product Manager 2 Overview What is NUKE? Image processing - exploiting the GPU The Foundry Approach Simple examples in NUKEX Real world

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

Alignment invariant image comparison implemented on the GPU

Alignment invariant image comparison implemented on the GPU Alignment invariant image comparison implemented on the GPU Hans Roos Highquest, Johannesburg hans.jmroos@gmail.com Yuko Roodt Highquest, Johannesburg yuko@highquest.co.za Willem A. Clarke, MIEEE, SAIEE

More information

Survey and future trends of efficient cryptographic function implementations on

Survey and future trends of efficient cryptographic function implementations on Edith Cowan University Research Online Australian Digital Forensics Conference Security Research Institute Conferences 2008 Survey and future trends of efficient cryptographic function implementations

More information

Parallel graph traversal for FPGA

Parallel graph traversal for FPGA LETTER IEICE Electronics Express, Vol.11, No.7, 1 6 Parallel graph traversal for FPGA Shice Ni a), Yong Dou, Dan Zou, Rongchun Li, and Qiang Wang National Laboratory for Parallel and Distributed Processing,

More information

Virtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili

Virtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili Virtual Memory Lecture notes from MKP and S. Yalamanchili Sections 5.4, 5.5, 5.6, 5.8, 5.10 Reading (2) 1 The Memory Hierarchy ALU registers Cache Memory Memory Memory Managed by the compiler Memory Managed

More information

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian) From Shader Code to a Teraflop: How GPU Shader Cores Work Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian) 1 This talk Three major ideas that make GPU processing cores run fast Closer look at real

More information

X. GPU Programming. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter X 1

X. GPU Programming. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter X 1 X. GPU Programming 320491: Advanced Graphics - Chapter X 1 X.1 GPU Architecture 320491: Advanced Graphics - Chapter X 2 GPU Graphics Processing Unit Parallelized SIMD Architecture 112 processing cores

More information

Performance Optimization Part II: Locality, Communication, and Contention

Performance Optimization Part II: Locality, Communication, and Contention Lecture 7: Performance Optimization Part II: Locality, Communication, and Contention Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Beth Rowley Nobody s Fault but Mine

More information

CMPE 665:Multiple Processor Systems CUDA-AWARE MPI VIGNESH GOVINDARAJULU KOTHANDAPANI RANJITH MURUGESAN

CMPE 665:Multiple Processor Systems CUDA-AWARE MPI VIGNESH GOVINDARAJULU KOTHANDAPANI RANJITH MURUGESAN CMPE 665:Multiple Processor Systems CUDA-AWARE MPI VIGNESH GOVINDARAJULU KOTHANDAPANI RANJITH MURUGESAN Graphics Processing Unit Accelerate the creation of images in a frame buffer intended for the output

More information

Architecture and Design of Shared Memory Multi- QueueCore Processor

Architecture and Design of Shared Memory Multi- QueueCore Processor University of Aizu, Graduation Thesis. March, 2011 s1150059 1 Architecture and Design of Shared Memory Multi- QueueCore Processor Shunichi Kato s1150059 Supervised by Prof. Ben Abdallah Abderazek Abstract

More information

Towards a Dynamically Reconfigurable System-on-Chip Platform for Video Signal Processing

Towards a Dynamically Reconfigurable System-on-Chip Platform for Video Signal Processing Towards a Dynamically Reconfigurable System-on-Chip Platform for Video Signal Processing Walter Stechele, Stephan Herrmann, Andreas Herkersdorf Technische Universität München 80290 München Germany Walter.Stechele@ei.tum.de

More information

B. Tech. Project Second Stage Report on

B. Tech. Project Second Stage Report on B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

High Speed Special Function Unit for Graphics Processing Unit

High Speed Special Function Unit for Graphics Processing Unit High Speed Special Function Unit for Graphics Processing Unit Abd-Elrahman G. Qoutb 1, Abdullah M. El-Gunidy 1, Mohammed F. Tolba 1, and Magdy A. El-Moursy 2 1 Electrical Engineering Department, Fayoum

More information

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA GPGPU LAB Case study: Finite-Difference Time- Domain Method on CUDA Ana Balevic IPVS 1 Finite-Difference Time-Domain Method Numerical computation of solutions to partial differential equations Explicit

More information

Aiyar, Mani Laxman. Keywords: MPEG4, H.264, HEVC, HDTV, DVB, FIR.

Aiyar, Mani Laxman. Keywords: MPEG4, H.264, HEVC, HDTV, DVB, FIR. 2015; 2(2): 201-209 IJMRD 2015; 2(2): 201-209 www.allsubjectjournal.com Received: 07-01-2015 Accepted: 10-02-2015 E-ISSN: 2349-4182 P-ISSN: 2349-5979 Impact factor: 3.762 Aiyar, Mani Laxman Dept. Of ECE,

More information

GPU-accelerated Verification of the Collatz Conjecture

GPU-accelerated Verification of the Collatz Conjecture GPU-accelerated Verification of the Collatz Conjecture Takumi Honda, Yasuaki Ito, and Koji Nakano Department of Information Engineering, Hiroshima University, Kagamiyama 1-4-1, Higashi Hiroshima 739-8527,

More information

Data parallel algorithms, algorithmic building blocks, precision vs. accuracy

Data parallel algorithms, algorithmic building blocks, precision vs. accuracy Data parallel algorithms, algorithmic building blocks, precision vs. accuracy Robert Strzodka Architecture of Computing Systems GPGPU and CUDA Tutorials Dresden, Germany, February 25 2008 2 Overview Parallel

More information

Graphics Processing Unit Architecture (GPU Arch)

Graphics Processing Unit Architecture (GPU Arch) Graphics Processing Unit Architecture (GPU Arch) With a focus on NVIDIA GeForce 6800 GPU 1 What is a GPU From Wikipedia : A specialized processor efficient at manipulating and displaying computer graphics

More information

Specializing Hardware for Image Processing

Specializing Hardware for Image Processing Lecture 6: Specializing Hardware for Image Processing Visual Computing Systems So far, the discussion in this class has focused on generating efficient code for multi-core processors such as CPUs and GPUs.

More information

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing) ME 290-R: General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing) Sara McMains Spring 2009 Lecture 7 Outline Last time Visibility Shading Texturing Today Texturing continued

More information

Large-Scale Network Simulation Scalability and an FPGA-based Network Simulator

Large-Scale Network Simulation Scalability and an FPGA-based Network Simulator Large-Scale Network Simulation Scalability and an FPGA-based Network Simulator Stanley Bak Abstract Network algorithms are deployed on large networks, and proper algorithm evaluation is necessary to avoid

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8 Markus Hadwiger, KAUST Reading Assignment #5 (until March 12) Read (required): Programming Massively Parallel Processors book, Chapter

More information

GPU for HPC. October 2010

GPU for HPC. October 2010 GPU for HPC Simone Melchionna Jonas Latt Francis Lapique October 2010 EPFL/ EDMX EPFL/EDMX EPFL/DIT simone.melchionna@epfl.ch jonas.latt@epfl.ch francis.lapique@epfl.ch 1 Moore s law: in the old days,

More information