Bridging the Gap between FPGAs and Multi-Processor Architectures: A Video Processing Perspective
|
|
- Marilynn Palmer
- 6 years ago
- Views:
Transcription
1 Bridging the Gap between FPGAs and Multi-Processor Architectures: A Video Processing Perspective Ben Cope 1, Peter Y.K. Cheung 1 and Wayne Luk 2 1 Department of Electrical & Electronic Engineering, Imperial College London 2 Department of Computing, Imperial College London {benjamin.cope, p.cheung,w.luk}@imperial.ac.uk Abstract This work explores how the graphics processing unit (GPU) pipeline model can influence future multi-core architectures which include reconfigurable logic cores. The design challenges of implementing five algorithms on two field programmable gate arrays (FPGAs) and two GPUs are explained and performance results contrasted. Explored algorithm features include data dependence, flexible data reuse patterns and histogram generation. A customisable SystemC model, which demonstrates that features of the GPU pipeline can be transferred to a general multi-core architecture, is implemented. The customisations are: choice of processing unit (PU); processing pattern; and on-chip memory organisation. Example tradeoffs are: the choice of processing pattern for histogram equalisation; choice of number of PUs; and memory sizing for motion vector estimation. It is shown that a multi-core architecture can be optimised for video processing by combining a GPU pipeline with cores that support reconfigurable datapath operations. 1. Introduction Whilst dual-core CPUs are becoming commonplace, NVIDIA has released the GeForce 8800 GTX 128-core GPU. GPU developers overcome the design and programming issues of multi-core scalability through a constrained pipeline model. This puts the GPU in direct competition with the FPGA as an accelerator in a video processing system. This work explores how the pipeline model of the GPU can influence future multi-core video processing architectures to improve performance and reduce design effort. The core of this architecture can be a processor or FPGA logic. The contributions of this work are: 1) the presentation of FPGA and GPU design challenges and performance results for five algorithms; 2) a customisable multi-core SystemC model; 3) an exploration of model customisations; and 4) a summary of lessons for future multi-core architectures. The paper is organised as follows: section 2 outlines related literature; key GPU pipeline features are described in section 3; section 4 describes FPGA and GPU design challenges for five algorithms; FPGA and GPU performance results are analysed in section 5; section 6 explores the customisable model; and conclusions are detailed in section Related Work FPGA architectures can be used to exploit regular data access patterns and parallelism in video processing algorithms. Sonic-on-chip [1] is an example multi-core FPGA architecture which exploits these factors. One application of sonic-on-chip is 3-step non-full-search motion vector estimation (NFS-MVE). A four channel (core) implementation achieves a target throughput of 6.8 million pixels per second (MP/s) [1]. The GPU is another multi-core architecture capable of exploiting data access locality and parallelism in algorithms. Literature shows examples of one to two orders of magnitude performance improvements for the GPU over the CPU [2, 3, 4]. The work presented here combines the GPU pipeline with reconfigurable logic core flexibility. Previous work by the authors on comparing the FPGA and GPU focused on an algorithm s match to the GPU instruction set and the effect of changing arithmetic intensity [2]. The arithmetic intensity of an algorithm is the ratio of arithmetic operations to the number of memory accesses. The results in section 5 consider additionally algorithms which require data dependence, memory gather and flexible data reuse. Memory gather in this work requires an entire video frame to be used to produce a small number of output values, for example, to produce a cumulative histogram. These demonstrate higher level characteristics of the FPGA and GPU architectures. 3. GPU Architecture The GPU cores of the NVIDIA GeForce 6800 GT and 7800 GTX graphics platforms are considered in this work. Key GPU architecture features, relevant to this work, are explained below. The focus is on the fragment processing /07/$ IEEE 308
2 Algorithm Number of Memory Reuse Memory Access Arithmetic Accesses (per pixel) Potential Pattern Intensity Bi-cubic Interpolation 16 low-medium predictable low Histogram Equalisation (HE) bin Calculation [of cumulative histogram] 1 (gather) null predictable medium Application 1+1(lookup table) null predictable low 3-step non-full search motion ( ) high locally high vector estimation (NFS-MVE) [1] random Primary Colour Correction (PCCR) [2] 1 null predictable high 2D Convolution (size n n) [2] n 2 high predictable low Table 2. Characterisation of a representative set of video processing algorithms stage of the GPU pipeline. For further details and background on the entire GPU pipeline refer to [6]. The GPU pipeline exploits iteration level parallelism through multiple fragment processors organised in groups of four (known as quads). The number of quads varies with the GPU model as shown in Table 1. The GPU is scalable to multiple quads because of two constraints: no result sharing between fragment processors and a feed through pipeline. Without these constraints, multi-core scalability would be restricted due to synchronisation constraints. The feed through pipeline of the GPU ensures an external memory location can only be read from or written to during each processing pass. This means that some algorithms require a multi-pass implementation [3]. Two examples of this are described in section 4. A further feature which simplifies the GPU programming model, and promotes scalability, is that all fragment processors execute the same program code. This code is commonly referred to as a fragment shader. To support the above features of the GPU pipeline, an efficient memory hierarchy is required. The components of the hierarchy are off-chip memory, GPU caches and output buffering. The memory requirements of graphics rendering demand the large memory bandwidth between the GPU and off-chip memory shown in Table 1. A shared cache is used to exploit data reuse between off-chip memory and the fragment processors. Literature suggests a 4-way associative 16 KBytes cache [7]. 4. Benchmark Algorithms Table 2 characterises five algorithms which represent features common to video processing. Characteristics Model Fragment Memory Core (Release Year) Processors Bandwidth Clock GF 7800 GTX (2005) (6 quads) GBytes/sec MHz GF 6800 GT (2004) (4 quads) GBytes/sec MHz Table 1. Specifications of the NVIDIA GeForce 6800 GT and 7800 GTX [5] which present interesting design challenges for FPGA and GPU implementation are italicised and described below. Bi-cubic interpolation can be used to resize a video frame to an arbitrarily higher resolution. This flexibility is indicated by the low to medium reuse potential in Table 2. An FPGA architecture exploits data reuse through using onchip buffers to store rows of a video frame. This requires a complex heuristic to implement a flexible reuse strategy and thus perform arbitrary resizing. GPU implementation of flexible reuse potential is straightforward because this is a feature supported by its memory hierarchy. Histogram equalisation (HE) is a special case of memory access which requires memory gather. To put this another way HE calculation requires an entire video frame to be reduced to, in this example, a 256 bin cumulative histogram. This is typical of the 8-bit resolution of video data. A multi-pass GPU implementation is required. Four rendering passes are needed to calculate the histogram and a final pass for its application to a video frame to equalise the intensity spectrum. The FPGA implementation requires a ROM decoder coupled with 256 parallel accumulators for HE calculation and a lookup table for HE application. NFS-MVE is a three step algorithm which on each step, for one search window, computes nine (a 3 3 grid of) sum of absolute difference (SAD) calculations with overlapping windows in a reference frame. The reference frame window locations in steps 2 and 3 are dependent on a comparison of SAD results from previous steps. The access pattern for NFS-MVE is termed locally random because the memory access in steps 2 and 3 is data dependent within a spatially local area. For GPU implementation NFS- MVE also requires a multi-pass approach. Each of the three steps of NFS-MVE requires three GPU passes for a total of nine rendering passes. The FPGA implementation of NFS- MVE is taken from work by Sedcole [1]. Implementation highlights of this were mentioned in section 2. The primary colour correction and 2D convolution algorithms show examples of high arithmetic intensity and variable memory access requirements respectfully [2]. For full implementation details and code for all the algorithms in Table 2 please contact the primary author. 309
3 Throughput Millions of Pixels per second (MP/s) Performance of Benchmark Algorithms on the FPGA and GPU Virtex 2 (Right) Virtex 4 GF 6800 GT GF 7800 GTX (left) 0 A B C 5x5 9x9 11x11 Primary H.Eq Non FS Interpolation 2D Convolution CCR MVE Figure 1. Throughput of Benchmarks on FPGA and GPU. Interpolation variants are to (A), to (B) and to (C) 5. Benchmarking the FPGA and GPU Performance results for the FPGA and GPU are shown in Figure 1. The FPGA implementations are coded in VHDL and synthesized using the Xilinx ISE 8.2i design suite. Post place and route speeds are shown. GPU code is compiled using the NVIDIA Cg Compiler version 1.5. The GLSL- Program library, created at Imperial College London, is used to create a layer of abstraction between C++ code and the OpenGL API. CPU cycle counts are used to measure GPU throughput. All benchmarks are implemented by the primary author with the exception of NFS-MVE on the FPGA [1]. The primary colour correction and 2D convolution results are updates from work published elsewhere [2]. An interesting result in Figure 1 is that the GeForce 7800 GTX implementation of primary colour correction outperforms the Virtex 4. This is the opposite of the result shown in prior work [2] on the comparison of the then state of the art GeForce 6800 GT and Virtex 2. This demonstrates the rapid performance improvement of modern GPUs. For bi-cubic interpolation, the GeForce 7800 GTX shows superior performance of up to three times over the FPGA. GPUs are well suited to algorithms with arbitrary reuse patterns, over a small window size (4 4 for bi-cubic interpolation), because this mimics graphics rendering. The FPGA speed is limited by the complex heuristic required for reuse pattern control logic. This is exemplified further by comparing the throughput of the FPGA interpolation and 2D convolution (size 5 5) implementations. 2D convolution has higher throughput despite the larger window size. This is due to the complex data reuse control logic required to implement arbitrary bi-cubic interpolation on the FPGA. For histogram equalisation (HE) the FPGA outperforms the GPU by over three times. This is due to the five pass algorithm, described in section 4, required for GPU implementation. The four passes required for cumulative histogram generation take 92% of this execution time. This highlights a limitation of the GPU when the designer desires to gather data from an entire frame to, for example, generate a cumulative histogram. This is a result of the limitation that fragment processors cannot share computation results and additionally that each fragment processor cannot hold computation results between pixels which it processes. These limitations are explored further in section 6.4. Data dependence for NFS-MVE is considered by using two video test sources provided by the project sponsor. GPU performance results are indistinguishable for any degree of video motion. This is a product of the GPU s heritage of computer graphics rendering which requires high performance for locally random memory accesses. NFS-MVE appears slow on the FPGA because the architecture is targeted at a desired throughput rate rather than maximal performance [1]. The performance of the GPU implementation of NFS-MVE is also low relative to the other algorithms. This is because of the large memory access requirements shown in Table 2. A multi-pass GPU implementation is not always a problem. For HE a more intuitive implementation can be made on the FPGA. However, for NFS-MVE the GPU approach was shown to have the highest performance despite a ninepass implementation. The requirement to implement a multi-pass approach is an acceptable sacrifice for the design and scalability benefits of the fixed GPU pipeline. A further interesting benefit of the GPU pipeline is its support for implementing arbitrary data reuse patterns. 310
4 Off-chip Memory Model 10 Cycle Estimates for Benchmarks on Model Pattern Generation Global Memory I/O On-Chip Memory MMU Processing Group 0 PU m... MMU Processing Group n PU m Output Buffer Cycle Estimate (10 6 ) GeForce 6800 GT (m=4,n=4) (m=4,n=6) (m=4,n=8) Figure 2. A Multi-Core Model (PU = Processing Unit, MMU = Memory Management Unit) 6. A Multi-Core Architecture This section explores the application of the GPU pipeline to a customisable multi-core architecture. In section 5 the GPU architecture was shown to be beneficial for implementing algorithms which require arbitrary data reuse but a limitation if generating a cumulative histogram. It is desired to include a subset of GPU pipeline features in a model of the customisable multi-core architecture. Two GPU pipeline features, identified in section 2 as key to the model s scalability, are: a feed-through pipeline; and no data sharing between fragment processors. The model s customisable options are: number of PUs (scalability); onchip memory; processing pattern; and choice of processing unit (PU). Section 6.1 describes the model. Three algorithms are used to verify the model and show scalability in section 6.2. In section 6.3 on-chip memory choice is explored through NFS-MVE. HE is used to explore the effect of a controlled processing pattern in section 6.4. Section 6.5 shows options for choice of processing unit. General lessons for future video processing architectures are presented in section The Model and Initial Setup The IEEE SystemC class library is chosen as the implementation platform. The flexibility of SystemC allows a high-level customisable multi-core model to be created. Figure 2 shows a high-level diagram of the model. This is a modified and simplified version of the fragment processing stage of the GPU pipeline [6]. The model and its differences from the GPU pipeline are described below. The multi-core hierarchy has m processing units (PUs) implemented in each of the n processing groups. Each processing group processes a block of m m pixels concurrently. For m = 4 and n = 4 this is similar to the GeForce 6800 GT fragment processor set-up. The feed-through pipeline is formed as follows. The pattern generation module supplies each processing group with 0 A B C 5x5 9x9 11x11 PCCR Interpolation 2D Convolution Figure 3. Performance trends for Benchmarks Implemented on the Multi-core Model for a pixel frame size the order in which to process pixels. The memory hierarchy is restricted such that data can only be fetched from one memory location and written to another. Outputs from the processing group are combined in an output buffer. The pixel processing order is produced by the pattern generation module and can be set to be any arbitrary sequence. This is in contrast to the GPU where the pixel processing order is determined by the output of the vertex processing and rasterisation pipeline stages. Processing units (PUs) connect to memory through a local memory management unit (MMU). Each MMU accesses pixels from the on-chip memory for m PUs. If the pixels are not available in on-chip memory the MMU arbitrates through the global I/O for off-chip memory access. Concurrent memory accesses by PUs are handled by the MMU. For conflicts between MMUs, concurrent memory accesses are handled by the global I/O. The round robin scheme is used for arbitration. This is similar to the method used for the texture unit (MMU) and fragment processors (PUs) in GPUs [8]. The initial on-chip memory setup is a 4-way associative 16k cache with 64 pixel cache lines. For verification purposes the PU model is a simple linear model of the GPU fragment processor. This models the execution time and memory access pattern of an algorithm. This allows high-level architecture exploration whilst keeping data level parallelism within each PU fixed. The initial processing order is chosen to be an iterative z-pattern [9]. This is a commonly used memory addressing pattern which exploits the spacial locality of neighbouring pixels Model Verification and Scalability In this section the model described in section 6.1 is verified and its scalability demonstrated. This is achieved through implementing three algorithms from Table 2 on the model. Figure 3 shows results for m = 4, n = 4, 6, 8 and a 311
5 frame size of pixels. The trend for the GeForce 6800 GT performance is plotted alongside these results. It can be seen that the trend of the model with m, n = 4 and the GeForce 6800 GT are similar. The model with m, n = 4 has a higher cycle count than the GeForce 6800 GT for interpolation and 2D convolution. The reason is that the GPU fragment processors are highly multi-threaded to hide memory access latency. This is a limitation of the model which does not include multi-threading. However, the aim is to model high level architecture features of the GPU and not the benefits of multi-threading. Model variations with n = 6, 8 are shown to demonstrate scalability. Primary colour correction has high arithmetic intensity and therefore cycle count for n = 8 is almost half that for n = 4. The cycle count ratio for interpolation is 1.8 times due to the behaviour of on-chip memory (cache) with increasing n. As n increases the situation arises where one PU may remove a data item from cache when another PU needs to reuse it. 2D convolution also scales less than two times for a doubling of processing groups. This effect becomes more prominent as convolution kernel size increases On-Chip Memory Flexibility The NFS-MVE benchmark requires nine passes for implementation on the GPU. This nine-pass GPU method is used to implement NFS-MVE on the model for m = 4 and n = 1, 2, 4. The results are shown on row one of Table 3. NFS-MVE is a memory intensive algorithm therefore onchip memory (cache) size is key to performance. The results for two and four-fold increases in cache size, keeping all other factors constant, are shown. The linear processing unit model, from section 6.1, is maintained for all results. Cache Model Model Model Changes (size) (m=4,n=1) (m=4,n=2) (m=4,n=4) Original (16k) 4.8 (17k) 3.6 (36k) 2.1 (70k) 2 (32k) 4.6 (12k) 2.5 (17k) 1.4 (30k) 4 (64k) 4.3 (9k) 2.2 (15k) 1.1 (16k) Table 3. Model performance in clock cycles 10 6 and (off-chip memory reads) for nine pass NFS-MVE over a video frame Table 3 demonstrates a key feature of the shared memory structure of the architecture of Figure 2. That is the effect the processing units have on each other by fetching new pixels to the cache. For memory intensive algorithms increasing the number of PUs by a factor of two will not achieve a two-fold performance improvement. In some cases this can be much less. This is demonstrated through the differences in model performance of varying n above. As memory size is increased the scaling of performance improvement with n also increases. This is because each processing unit now removes less data from cache that other PUs require. The number of memory reads to external memory verify these observations by reducing proportional to performance improvement between changes in cache size A Controlled Processing Pattern The benchmark implementations above have been implemented in the same manner as for the GPU. A benefit of the controllable processing pattern for the multi-core model is now shown. The performance of histogram equalisation (HE) on the GPU was shown in Figure 1 to be inferior to the FPGA. This is due to the GPU s inflexibility in implementing memory gather. An improved performance can be achieved on the model in Figure 2, over the GPU, because a PU can hold computation results between processed pixels. This is possible because the processing order is controllable. Consider distributing the histogram calculation across multiple PUs in Figure 2. For n, m = 4 this equates to 16 accumulators in each processing unit to generate the 256 bin cumulative histogram. Each PU must now access the entire frame of pixels for the full histogram to be computed. This can be achieved through correct choice of processing order. The result for HE calculation is shown in Table 4. Table 4 also shows the implementation of the original fourpass GPU technique on the model for comparison. For a fair comparison the linear PU model of section 6.2 is maintained for both model implementations. Therefore the improved method performs the accumulations in series. Model GF 6800 GT Original (four pass) Improvement (single pass) Table 4. Model performance in clock cycles 10 7 for HE calculation over a frame To verify the improved model implementation a frame sixteen times larger than the target frame size is rendered on the GeForce 6800 GT. This is used to model memory access and performance by modelling the compute latency of 16 accumulators on the GPU. The result is shown in Table 4. The improved implementation on the multi-core architecture provides a two-fold improvement over the original method. This is achieved by removing the multiple passes of the original HE design, keeping all other factors constant Processing Unit (PU) Choice The above analysis uses a fixed PU model to maintain an equal level of parallelism for each PU. Different PU options are now discussed through comparing the choice of a fragment processor PU with a reconfigurable datapath PU. A fragment processor can perform at peak eight (two 4-vector) FLOPs per cycle [6]. It exploits no inter-pixel data reuse and has a high core clock rate as shown in Table 1. For reconfigurable datapaths all these features are variables. The resource usages for four example FPGA 312
6 (reconfigurable datapath) implementations, taken from section 5, are shown in Table 5. The number of BRAMs indicates the degree of data reuse. Slice usage indicates the amount of parallelism. The clock rates for the FPGA implementations are lower than a fragment processor (clock rate equals FPGA throughput in Figure 1). The first observation is a tradeoff between the high clock rate and fixed parallelism of a fragment processor against the arbitrary data reuse and parallelism of a reconfigurable datapath. These are characteristics of a hardware-software tradeoff. Benchmark Slices BRAMs Bicubic Interpolation 2.7k 24 2D Convolution k 12 2D Convolution k 24 2D Convolution k 30 Primary Colour Correction 3.6k 0 Histogram Equalisation (HE) 10.6k 0 Table 5. Slice count and Block RAM usage for Benchmark Algorithms on the FPGA Fragment processors are designed to be scalable as outlined in section 3. For a reconfigurable datapath PU scalability depends on whether and how data reuse is exploited. For implementations of primary colour correction and HE on a reconfigurable datapath, no data reuse is exploited. This makes them inherently scalable. For 2D convolution and interpolation the data reuse strategy must be considered. For scalability to be possible, subsequent frames (or regions of a frame) must be assigned to different PUs. This can be implemented through controlling the processing pattern. The design effort to implement the interpolation and 2D convolution algorithms in a reconfigurable datapath must be traded against their data reuse and parallelism benefits. The requirement for a complex heuristic to implement flexible data reuse in a reconfigurable datapath makes a fragment processor PU the suitable option for interpolation. The extra parallelism available from a reconfigurable datapath makes it the most desirable for HE. For 2D convolution a regular data reuse pattern can be exploited which also makes the reconfigurable datapath PU the superior option Generalisations from Model Changes It has been shown that a subset of GPU pipeline features can be applied to a customisable multi-core model. A feedthrough pipeline and no data sharing between processing units (PUs) are features key to multi-core scalability. Customisations of number of PUs, memory organisation and processing pattern are necessary variables for a multi-core architecture. The generality of the multi-core architecture means that the option of a reconfigurable datapath PU can be considered. The choice between a fragment processor and reconfigurable datapath PU is a tradeoff between data reuse (which influences scalability) and parallelism. 7. Conclusion Five video processing algorithms have been implemented on the FPGA and GPU. Algorithms which require a regular data reuse pattern perform well on the FPGA. For irregular reuse patterns the GPU outperforms the FPGA. Histogram equalisation (HE) and NFS-MVE are two algorithms which require multi-pass GPU implementations. Although the GPU performance is 50 MP/s, an FPGA implementation of HE outperforms the GPU by three times. A customisable multi-core model based on the GPU pipeline has been implemented using the SystemC library. Features of the GPU pipeline and model customisations are explored through the model. A feed-through pipeline and no data sharing between PUs are two GPU pipeline features which are key to multi-core scalability. It is concluded that a multi-core architecture can be optimised for video processing by combining a GPU pipeline with cores that support reconfigurable datapath operations. Future work will involve exploration of other multi-core architecture features such as those of the STI Cell Broadband Engine and GeForce 8800 GTX. We gratefully acknowledge the support provided by the UK Engineering and Physical Sciences Research Council (EP/C549481/1), and Sony Broadcast & Professional Europe, and Jay Cornwall and Lee Howes from Imperial College London for providing GPU programming advice and the GLSLProgram class library. References [1] P. Sedcole. Reconfigurable platform-based design in FPGAs for video image processing. In PhD Thesis, University of London, pp , Jan [2] B. Cope, P.Y.K. Cheung, W. Luk, and S. Witt. Have GPUs made FPGAs redundant in the field of video processing? In Proc. Field Programmable Technology, pp , Dec [3] J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Krger, A. E. Lefohn, and T. J. Purcell. A survey of general-purpose computation on graphics hardware. In Eurographics 2005, State of the Art Reports, pp , Aug [4] P. Colantoni, N. Boukala, and J. D. Rugna. Fast and accurate color image processing using 3D graphics cards. In Proc. Vision, Modeling and Visualization, pp , [5] NVIDIA Corporation [6] M. Pharr and R. Fernando. GPU Gems 2. Addison Wesley, [7] V. Moya, C. Golzalez, J. Roca, and A. Fernandez. Shader performance analysis on a modern GPU architecture. In 38th Annual IEEE/ACM International Symposium on Microarchitecture, pp , [8] J. Lindholm, J. Nickolls, S. Moy, and B. Coon. Register based queuing for texture requests. United States Patent No. US 7,027,062 B2, [9] C. Priem, G. Solanki, and D. Kirk. Texture cache for a computer graphics accelerator. United States Patent No. US 7,136,068 B1,
Journal of Universal Computer Science, vol. 14, no. 14 (2008), submitted: 30/9/07, accepted: 30/4/08, appeared: 28/7/08 J.
Journal of Universal Computer Science, vol. 14, no. 14 (2008), 2416-2427 submitted: 30/9/07, accepted: 30/4/08, appeared: 28/7/08 J.UCS Tabu Search on GPU Adam Janiak (Institute of Computer Engineering
More informationData-Parallel Algorithms on GPUs. Mark Harris NVIDIA Developer Technology
Data-Parallel Algorithms on GPUs Mark Harris NVIDIA Developer Technology Outline Introduction Algorithmic complexity on GPUs Algorithmic Building Blocks Gather & Scatter Reductions Scan (parallel prefix)
More informationCSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller
Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,
More informationBifurcation Between CPU and GPU CPUs General purpose, serial GPUs Special purpose, parallel CPUs are becoming more parallel Dual and quad cores, roadm
XMT-GPU A PRAM Architecture for Graphics Computation Tom DuBois, Bryant Lee, Yi Wang, Marc Olano and Uzi Vishkin Bifurcation Between CPU and GPU CPUs General purpose, serial GPUs Special purpose, parallel
More informationComputer Architecture
Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,
More informationCS427 Multicore Architecture and Parallel Computing
CS427 Multicore Architecture and Parallel Computing Lecture 6 GPU Architecture Li Jiang 2014/10/9 1 GPU Scaling A quiet revolution and potential build-up Calculation: 936 GFLOPS vs. 102 GFLOPS Memory Bandwidth:
More informationCSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand
More informationRECONFIGURABLE SYSTEMS FOR VIDEO PROCESSING
RECONFIGURABLE SYSTEMS FOR VIDEO PROCESSING by BENJAMIN THOMAS COPE A report submitted in fulfilment of requirements for the MPhil to PhD transfer examination. Circuits and Systems Group Dept of Electrical
More informationhigh performance medical reconstruction using stream programming paradigms
high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming
More informationA Detailed GPU Cache Model Based on Reuse Distance Theory
A Detailed GPU Cache Model Based on Reuse Distance Theory Cedric Nugteren, Gert-Jan van den Braak, Henk Corporaal Eindhoven University of Technology (Netherlands) Henri Bal Vrije Universiteit Amsterdam
More informationComputer and Hardware Architecture I. Benny Thörnberg Associate Professor in Electronics
Computer and Hardware Architecture I Benny Thörnberg Associate Professor in Electronics Hardware architecture Computer architecture The functionality of a modern computer is so complex that no human can
More informationPowerVR Hardware. Architecture Overview for Developers
Public Imagination Technologies PowerVR Hardware Public. This publication contains proprietary information which is subject to change without notice and is supplied 'as is' without warranty of any kind.
More informationReconfigurable Acceleration of Fitness Evaluation in Trading Strategies
Reconfigurable Acceleration of Fitness Evaluation in Trading Strategies INGRID FUNIE, PAUL GRIGORAS, PAVEL BUROVSKIY, WAYNE LUK, MARK SALMON Department of Computing Imperial College London Published in
More informationA Configurable Multi-Ported Register File Architecture for Soft Processor Cores
A Configurable Multi-Ported Register File Architecture for Soft Processor Cores Mazen A. R. Saghir and Rawan Naous Department of Electrical and Computer Engineering American University of Beirut P.O. Box
More informationREDUCING BEAMFORMING CALCULATION TIME WITH GPU ACCELERATED ALGORITHMS
BeBeC-2014-08 REDUCING BEAMFORMING CALCULATION TIME WITH GPU ACCELERATED ALGORITHMS Steffen Schmidt GFaI ev Volmerstraße 3, 12489, Berlin, Germany ABSTRACT Beamforming algorithms make high demands on the
More informationGPU Architecture and Function. Michael Foster and Ian Frasch
GPU Architecture and Function Michael Foster and Ian Frasch Overview What is a GPU? How is a GPU different from a CPU? The graphics pipeline History of the GPU GPU architecture Optimizations GPU performance
More informationGPGPU. Peter Laurens 1st-year PhD Student, NSC
GPGPU Peter Laurens 1st-year PhD Student, NSC Presentation Overview 1. What is it? 2. What can it do for me? 3. How can I get it to do that? 4. What s the catch? 5. What s the future? What is it? Introducing
More informationLecture 25: Board Notes: Threads and GPUs
Lecture 25: Board Notes: Threads and GPUs Announcements: - Reminder: HW 7 due today - Reminder: Submit project idea via (plain text) email by 11/24 Recap: - Slide 4: Lecture 23: Introduction to Parallel
More informationACCELERATING SIGNAL PROCESSING ALGORITHMS USING GRAPHICS PROCESSORS
ACCELERATING SIGNAL PROCESSING ALGORITHMS USING GRAPHICS PROCESSORS Ashwin Prasad and Pramod Subramanyan RF and Communications R&D National Instruments, Bangalore 560095, India Email: {asprasad, psubramanyan}@ni.com
More informationThe S6000 Family of Processors
The S6000 Family of Processors Today s Design Challenges The advent of software configurable processors In recent years, the widespread adoption of digital technologies has revolutionized the way in which
More informationMassively Parallel Computing on Silicon: SIMD Implementations. V.M.. Brea Univ. of Santiago de Compostela Spain
Massively Parallel Computing on Silicon: SIMD Implementations V.M.. Brea Univ. of Santiago de Compostela Spain GOAL Give an overview on the state-of of-the- art of Digital on-chip CMOS SIMD Solutions,
More informationParallel FIR Filters. Chapter 5
Chapter 5 Parallel FIR Filters This chapter describes the implementation of high-performance, parallel, full-precision FIR filters using the DSP48 slice in a Virtex-4 device. ecause the Virtex-4 architecture
More informationAccelerating CFD with Graphics Hardware
Accelerating CFD with Graphics Hardware Graham Pullan (Whittle Laboratory, Cambridge University) 16 March 2009 Today Motivation CPUs and GPUs Programming NVIDIA GPUs with CUDA Application to turbomachinery
More informationGPU Accelerating Speeded-Up Robust Features Timothy B. Terriberry, Lindley M. French, and John Helmsen
GPU Accelerating Speeded-Up Robust Features Timothy B. Terriberry, Lindley M. French, and John Helmsen Overview of ArgonST Manufacturer of integrated sensor hardware and sensor analysis systems 2 RF, COMINT,
More informationHiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.
HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes Ian Glendinning Outline NVIDIA GPU cards CUDA & OpenCL Parallel Implementation
More informationFrom Brook to CUDA. GPU Technology Conference
From Brook to CUDA GPU Technology Conference A 50 Second Tutorial on GPU Programming by Ian Buck Adding two vectors in C is pretty easy for (i=0; i
More informationCustomisable EPIC Processor: Architecture and Tools
Customisable EPIC Processor: Architecture and Tools W.W.S. Chu, R.G. Dimond, S. Perrott, S.P. Seng and W. Luk Department of Computing, Imperial College London 180 Queen s Gate, London SW7 2BZ, UK Abstract
More informationSupporting Multithreading in Configurable Soft Processor Cores
Supporting Multithreading in Configurable Soft Processor Cores Roger Moussali, Nabil Ghanem, and Mazen A. R. Saghir Department of Electrical and Computer Engineering American University of Beirut P.O.
More informationThreading Hardware in G80
ing Hardware in G80 1 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA 2 3D 3D API: API: OpenGL OpenGL or or Direct3D Direct3D GPU Command &
More informationVertex Shader Design I
The following content is extracted from the paper shown in next page. If any wrong citation or reference missing, please contact ldvan@cs.nctu.edu.tw. I will correct the error asap. This course used only
More informationCUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni
CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3
More informationGPUs and GPGPUs. Greg Blanton John T. Lubia
GPUs and GPGPUs Greg Blanton John T. Lubia PROCESSOR ARCHITECTURAL ROADMAP Design CPU Optimized for sequential performance ILP increasingly difficult to extract from instruction stream Control hardware
More informationOverview. Videos are everywhere. But can take up large amounts of resources. Exploit redundancy to reduce file size
Overview Videos are everywhere But can take up large amounts of resources Disk space Memory Network bandwidth Exploit redundancy to reduce file size Spatial Temporal General lossless compression Huffman
More informationPortland State University ECE 588/688. Graphics Processors
Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationOverview of ROCCC 2.0
Overview of ROCCC 2.0 Walid Najjar and Jason Villarreal SUMMARY FPGAs have been shown to be powerful platforms for hardware code acceleration. However, their poor programmability is the main impediment
More informationComputing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany
Computing on GPUs Prof. Dr. Uli Göhner DYNAmore GmbH Stuttgart, Germany Summary: The increasing power of GPUs has led to the intent to transfer computing load from CPUs to GPUs. A first example has been
More informationIs There A Tradeoff Between Programmability and Performance?
Is There A Tradeoff Between Programmability and Performance? Robert Halstead Jason Villarreal Jacquard Computing, Inc. Roger Moussalli Walid Najjar Abstract While the computational power of Field Programmable
More informationCS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology
CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367
More informationCustom computing systems
Custom computing systems difference engine: Charles Babbage 1832 - compute maths tables digital orrery: MIT 1985 - special-purpose engine, found pluto motion chaotic Splash2: Supercomputing esearch Center
More informationGPU Computation Strategies & Tricks. Ian Buck NVIDIA
GPU Computation Strategies & Tricks Ian Buck NVIDIA Recent Trends 2 Compute is Cheap parallelism to keep 100s of ALUs per chip busy shading is highly parallel millions of fragments per frame 0.5mm 64-bit
More informationHardware Acceleration of Edge Detection Algorithm on FPGAs
Hardware Acceleration of Edge Detection Algorithm on FPGAs Muthukumar Venkatesan and Daggu Venkateshwar Rao Department of Electrical and Computer Engineering University of Nevada Las Vegas. Las Vegas NV
More informationComparison of High-Speed Ray Casting on GPU
Comparison of High-Speed Ray Casting on GPU using CUDA and OpenGL November 8, 2008 NVIDIA 1,2, Andreas Weinlich 1, Holger Scherl 2, Markus Kowarschik 2 and Joachim Hornegger 1 1 Chair of Pattern Recognition
More informationPower Profiling and Optimization for Heterogeneous Multi-Core Systems
Power Profiling and Optimization for Heterogeneous Multi-Core Systems Kuen Hung Tsoi and Wayne Luk Department of Computing, Imperial College London {khtsoi, wl}@doc.ic.ac.uk ABSTRACT Processing speed and
More informationOptimization solutions for the segmented sum algorithmic function
Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code
More informationWhat s New with GPGPU?
What s New with GPGPU? John Owens Assistant Professor, Electrical and Computer Engineering Institute for Data Analysis and Visualization University of California, Davis Microprocessor Scaling is Slowing
More informationThe future is parallel but it may not be easy
The future is parallel but it may not be easy Michael J. Flynn Maxeler and Stanford University M. J. Flynn 1 HiPC Dec 07 Outline I The big technology tradeoffs: area, time, power HPC: What s new at the
More informationData Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions
Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing
More informationReal-Time Rendering Architectures
Real-Time Rendering Architectures Mike Houston, AMD Part 1: throughput processing Three key concepts behind how modern GPU processing cores run code Knowing these concepts will help you: 1. Understand
More informationEfficient Stream Reduction on the GPU
Efficient Stream Reduction on the GPU David Roger Grenoble University Email: droger@inrialpes.fr Ulf Assarsson Chalmers University of Technology Email: uffe@chalmers.se Nicolas Holzschuch Cornell University
More informationA SIMD-efficient 14 Instruction Shader Program for High-Throughput Microtriangle Rasterization
A SIMD-efficient 14 Instruction Shader Program for High-Throughput Microtriangle Rasterization Jordi Roca Victor Moya Carlos Gonzalez Vicente Escandell Albert Murciego Agustin Fernandez, Computer Architecture
More informationScalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA
Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA Yun R. Qu, Viktor K. Prasanna Ming Hsieh Dept. of Electrical Engineering University of Southern California Los Angeles, CA 90089
More informationLecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1
Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei
More informationImproved Integral Histogram Algorithm. for Big Sized Images in CUDA Environment
Contemporary Engineering Sciences, Vol. 7, 2014, no. 24, 1415-1423 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ces.2014.49174 Improved Integral Histogram Algorithm for Big Sized Images in CUDA
More informationHammer Slide: Work- and CPU-efficient Streaming Window Aggregation
Large-Scale Data & Systems Group Hammer Slide: Work- and CPU-efficient Streaming Window Aggregation Georgios Theodorakis, Alexandros Koliousis, Peter Pietzuch, Holger Pirk Large-Scale Data & Systems (LSDS)
More informationFrequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System
Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM
More informationA Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications Jeremy Fowers, Greg Brown, Patrick Cooke, Greg Stitt University of Florida Department of Electrical and
More informationXIV International PhD Workshop OWD 2012, October Optimal structure of face detection algorithm using GPU architecture
XIV International PhD Workshop OWD 2012, 20 23 October 2012 Optimal structure of face detection algorithm using GPU architecture Dmitry Pertsau, Belarusian State University of Informatics and Radioelectronics
More informationProfiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency
Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu
More informationSpeed up a Machine-Learning-based Image Super-Resolution Algorithm on GPGPU
Speed up a Machine-Learning-based Image Super-Resolution Algorithm on GPGPU Ke Ma 1, and Yao Song 2 1 Department of Computer Sciences 2 Department of Electrical and Computer Engineering University of Wisconsin-Madison
More informationDIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka
USE OF FOR Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in Prague Mini workshop on advanced numerical methods
More informationAdvanced Deferred Rendering Techniques. NCCA, Thesis Portfolio Peter Smith
Advanced Deferred Rendering Techniques NCCA, Thesis Portfolio Peter Smith August 2011 Abstract The following paper catalogues the improvements made to a Deferred Renderer created for an earlier NCCA project.
More informationGeneral Purpose GPU Programming. Advanced Operating Systems Tutorial 9
General Purpose GPU Programming Advanced Operating Systems Tutorial 9 Tutorial Outline Review of lectured material Key points Discussion OpenCL Future directions 2 Review of Lectured Material Heterogeneous
More informationCOE 561 Digital System Design & Synthesis Introduction
1 COE 561 Digital System Design & Synthesis Introduction Dr. Aiman H. El-Maleh Computer Engineering Department King Fahd University of Petroleum & Minerals Outline Course Topics Microelectronics Design
More informationWindowing System on a 3D Pipeline. February 2005
Windowing System on a 3D Pipeline February 2005 Agenda 1.Overview of the 3D pipeline 2.NVIDIA software overview 3.Strengths and challenges with using the 3D pipeline GeForce 6800 220M Transistors April
More informationA Reconfigurable Architecture for Load-Balanced Rendering
A Reconfigurable Architecture for Load-Balanced Rendering Jiawen Chen Michael I. Gordon William Thies Matthias Zwicker Kari Pulli Frédo Durand Graphics Hardware July 31, 2005, Los Angeles, CA The Load
More informationOptimisation Myths and Facts as Seen in Statistical Physics
Optimisation Myths and Facts as Seen in Statistical Physics Massimo Bernaschi Institute for Applied Computing National Research Council & Computer Science Department University La Sapienza Rome - ITALY
More informationEfficient Scan-Window Based Object Detection using GPGPU
Efficient Scan-Window Based Object Detection using GPGPU Li Zhang and Ramakant Nevatia University of Southern California Institute of Robotics and Intelligent Systems {li.zhang nevatia}@usc.edu Abstract
More informationIntroduction to FPGA Design with Vivado High-Level Synthesis. UG998 (v1.0) July 2, 2013
Introduction to FPGA Design with Vivado High-Level Synthesis Notice of Disclaimer The information disclosed to you hereunder (the Materials ) is provided solely for the selection and use of Xilinx products.
More informationReal-Time Reyes: Programmable Pipelines and Research Challenges. Anjul Patney University of California, Davis
Real-Time Reyes: Programmable Pipelines and Research Challenges Anjul Patney University of California, Davis Real-Time Reyes-Style Adaptive Surface Subdivision Anjul Patney and John D. Owens SIGGRAPH Asia
More informationGeneral Purpose GPU Computing in Partial Wave Analysis
JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data
More informationImaging Sensor with Integrated Feature Extraction Using Connected Component Labeling
Imaging Sensor with Integrated Feature Extraction Using Connected Component Labeling M. Klaiber, S. Ahmed, M. Najmabadi, Y.Baroud, W. Li, S.Simon Institute for Parallel and Distributed Systems, Department
More informationInterleaved Pixel Lookup for Embedded Computer Vision
Interleaved Pixel Lookup for Embedded Computer Vision Kota Yamaguchi, Yoshihiro Watanabe, Takashi Komuro, and Masatoshi Ishikawa Graduate School of Information Science and Technology, The University of
More informationFPGA BASED ACCELERATION OF THE LINPACK BENCHMARK: A HIGH LEVEL CODE TRANSFORMATION APPROACH
FPGA BASED ACCELERATION OF THE LINPACK BENCHMARK: A HIGH LEVEL CODE TRANSFORMATION APPROACH Kieron Turkington, Konstantinos Masselos, George A. Constantinides Department of Electrical and Electronic Engineering,
More informationAccelerating high-end compositing with CUDA in NUKE. Jon Wadelton NUKE Product Manager
Accelerating high-end compositing with CUDA in NUKE Jon Wadelton NUKE Product Manager 2 Overview What is NUKE? Image processing - exploiting the GPU The Foundry Approach Simple examples in NUKEX Real world
More informationCS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS
CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight
More informationAlignment invariant image comparison implemented on the GPU
Alignment invariant image comparison implemented on the GPU Hans Roos Highquest, Johannesburg hans.jmroos@gmail.com Yuko Roodt Highquest, Johannesburg yuko@highquest.co.za Willem A. Clarke, MIEEE, SAIEE
More informationSurvey and future trends of efficient cryptographic function implementations on
Edith Cowan University Research Online Australian Digital Forensics Conference Security Research Institute Conferences 2008 Survey and future trends of efficient cryptographic function implementations
More informationParallel graph traversal for FPGA
LETTER IEICE Electronics Express, Vol.11, No.7, 1 6 Parallel graph traversal for FPGA Shice Ni a), Yong Dou, Dan Zou, Rongchun Li, and Qiang Wang National Laboratory for Parallel and Distributed Processing,
More informationVirtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili
Virtual Memory Lecture notes from MKP and S. Yalamanchili Sections 5.4, 5.5, 5.6, 5.8, 5.10 Reading (2) 1 The Memory Hierarchy ALU registers Cache Memory Memory Memory Managed by the compiler Memory Managed
More informationFrom Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)
From Shader Code to a Teraflop: How GPU Shader Cores Work Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian) 1 This talk Three major ideas that make GPU processing cores run fast Closer look at real
More informationX. GPU Programming. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter X 1
X. GPU Programming 320491: Advanced Graphics - Chapter X 1 X.1 GPU Architecture 320491: Advanced Graphics - Chapter X 2 GPU Graphics Processing Unit Parallelized SIMD Architecture 112 processing cores
More informationPerformance Optimization Part II: Locality, Communication, and Contention
Lecture 7: Performance Optimization Part II: Locality, Communication, and Contention Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Beth Rowley Nobody s Fault but Mine
More informationCMPE 665:Multiple Processor Systems CUDA-AWARE MPI VIGNESH GOVINDARAJULU KOTHANDAPANI RANJITH MURUGESAN
CMPE 665:Multiple Processor Systems CUDA-AWARE MPI VIGNESH GOVINDARAJULU KOTHANDAPANI RANJITH MURUGESAN Graphics Processing Unit Accelerate the creation of images in a frame buffer intended for the output
More informationArchitecture and Design of Shared Memory Multi- QueueCore Processor
University of Aizu, Graduation Thesis. March, 2011 s1150059 1 Architecture and Design of Shared Memory Multi- QueueCore Processor Shunichi Kato s1150059 Supervised by Prof. Ben Abdallah Abderazek Abstract
More informationTowards a Dynamically Reconfigurable System-on-Chip Platform for Video Signal Processing
Towards a Dynamically Reconfigurable System-on-Chip Platform for Video Signal Processing Walter Stechele, Stephan Herrmann, Andreas Herkersdorf Technische Universität München 80290 München Germany Walter.Stechele@ei.tum.de
More informationB. Tech. Project Second Stage Report on
B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationHigh Speed Special Function Unit for Graphics Processing Unit
High Speed Special Function Unit for Graphics Processing Unit Abd-Elrahman G. Qoutb 1, Abdullah M. El-Gunidy 1, Mohammed F. Tolba 1, and Magdy A. El-Moursy 2 1 Electrical Engineering Department, Fayoum
More informationGPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA
GPGPU LAB Case study: Finite-Difference Time- Domain Method on CUDA Ana Balevic IPVS 1 Finite-Difference Time-Domain Method Numerical computation of solutions to partial differential equations Explicit
More informationAiyar, Mani Laxman. Keywords: MPEG4, H.264, HEVC, HDTV, DVB, FIR.
2015; 2(2): 201-209 IJMRD 2015; 2(2): 201-209 www.allsubjectjournal.com Received: 07-01-2015 Accepted: 10-02-2015 E-ISSN: 2349-4182 P-ISSN: 2349-5979 Impact factor: 3.762 Aiyar, Mani Laxman Dept. Of ECE,
More informationGPU-accelerated Verification of the Collatz Conjecture
GPU-accelerated Verification of the Collatz Conjecture Takumi Honda, Yasuaki Ito, and Koji Nakano Department of Information Engineering, Hiroshima University, Kagamiyama 1-4-1, Higashi Hiroshima 739-8527,
More informationData parallel algorithms, algorithmic building blocks, precision vs. accuracy
Data parallel algorithms, algorithmic building blocks, precision vs. accuracy Robert Strzodka Architecture of Computing Systems GPGPU and CUDA Tutorials Dresden, Germany, February 25 2008 2 Overview Parallel
More informationGraphics Processing Unit Architecture (GPU Arch)
Graphics Processing Unit Architecture (GPU Arch) With a focus on NVIDIA GeForce 6800 GPU 1 What is a GPU From Wikipedia : A specialized processor efficient at manipulating and displaying computer graphics
More informationSpecializing Hardware for Image Processing
Lecture 6: Specializing Hardware for Image Processing Visual Computing Systems So far, the discussion in this class has focused on generating efficient code for multi-core processors such as CPUs and GPUs.
More informationGeneral Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)
ME 290-R: General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing) Sara McMains Spring 2009 Lecture 7 Outline Last time Visibility Shading Texturing Today Texturing continued
More informationLarge-Scale Network Simulation Scalability and an FPGA-based Network Simulator
Large-Scale Network Simulation Scalability and an FPGA-based Network Simulator Stanley Bak Abstract Network algorithms are deployed on large networks, and proper algorithm evaluation is necessary to avoid
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationCS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST
CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8 Markus Hadwiger, KAUST Reading Assignment #5 (until March 12) Read (required): Programming Massively Parallel Processors book, Chapter
More informationGPU for HPC. October 2010
GPU for HPC Simone Melchionna Jonas Latt Francis Lapique October 2010 EPFL/ EDMX EPFL/EDMX EPFL/DIT simone.melchionna@epfl.ch jonas.latt@epfl.ch francis.lapique@epfl.ch 1 Moore s law: in the old days,
More information