Efficient Parallel Simulation of an Individual-Based Fish Schooling Model on a Graphics Processing Unit

Size: px
Start display at page:

Download "Efficient Parallel Simulation of an Individual-Based Fish Schooling Model on a Graphics Processing Unit"

Transcription

1 Efficient Parallel Simulation of an Individual-Based Fish Schooling Model on a Graphics Processing Unit Hong Li Department of Computer Science University of California, Santa Barbara, CA hongli@cs.ucsb.edu Allison Kolpas Department of Mathematics University of California, Santa Barbara, CA allie@math.ucsb.edu Linda Petzold Department of Computer Science Department of Mechanical Engineering University of California, Santa Barbara, CA petzold@cs.ucsb.edu Jeff Moehlis Department of Mechanical Engineering University of California, Santa Barbara, CA moehlis@engineering.ucsb.edu April 8, 2008

2 Abstract Due to their low cost and high performance processing capabilities, graphics processing units (GPUs) have become an attractive alternative to clusters for some scientific computing applications. In this paper we show how stochastic simulation of an individual-based fish schooling model can be efficiently carried out on a general-purpose GPU. We describe our implementation and present computational results to illustrate the power of this new technology. 1 Introduction Driven by the video gaming industry, the Graphics Processing Unit (GPU) has evolved into an inexpensive yet powerful high speed computing device for scientific applications. The GPU has a highly parallel structure, high memory bandwidth, and more transistors devoted to data processing than to data caching and flow control than a CPU [8]. Problems that can be implemented with stream processing and limited memory are well-suited for the GPU. Single Instruction Multiple Data (SIMD) computation, which involves a large number of totally independent records being processed by the same sequence of operations simultaneously, is an ideal type of GPU application. In recent years, computation on the general-purpose GPU (GPGPU) has become an active research field, with a wide range of applications including cellular automata, particle systems, fluid dynamics and computational geometry [3, 10, 5, 4]. Previous generation GPUs have required non-graphics applications to be recast into graphics computation through a graphics application programming interface (API). Last year, NVIDIA released a Compute Unified Device Architecture (CUDA) toolkit for their GPUs, providing general purpose functionality with a C-like language for non-graphics application. 2

3 Fish schooling is an important example of self-organized collective motion in animal groups. The collective behavior of the group emerges from local interactions of individuals with their neighbors, without any regard to a leader, template, or other external cue. Such groups can be composed of hundreds to millions of members, with all individuals responding rapidly to their neighbors to maintain the collective motion. In this paper we show how long-time stochastic simulations of an individual-based fish schooling model can be efficiently performed on a CUDA-enabled GPU. In the schooling model, each organism or agent is treated individually, with rules specifying its dynamics and interactions with other agents. Noise is included to account for imperfect sensing and processing. For different values of the parameters, different schooling behaviors emerge, including a swarm state in which individuals incoherently move about a central location, and a mobile state, in which individuals travel in an aligned polarized group. A large number of realizations are necessary to accurately determine the statistics of the collective motion. Simulations of the schooling model carry a high computational cost and can benefit from parallel processing. In the model, each agent updates its direction of travel based on the positions and directions of travel of all other agents. This computation can be performed in parallel across individuals within a single realization. In addition, the realizations can be performed in parallel. The architecture of the GPU is quite suitable for both types of parallel processing. In the following, we first describe the details of the individual-based fishschooling model. We then review the features of the GPU and show how parallel processing both across individuals within a single realization and across realizations can be implemented on a CUDA-enabled GPU to efficiently perform ensemble long-time simulations of the schooling model. 3

4 2 Fish Schooling Model Many organisms move and travel together in self-organizing groups, including flocks of birds, schools of fish, and swarms of locusts [1]. Individual-based models (IBM) are frequently used to describe the dynamics of such groups since they can incorporate biologically realistic social interactions and behavioral responses as well as relate individual-level behaviors to emergent population-level dynamics. Here we consider a two-dimensional individual-based model for fish schooling. This model is similar to that considered in [2], but without an informed leader, and with different weights of orientation and attraction response. Groups are composed of N individuals with positions p i (t), unit directions ˆv i (t), constant speed s, and maximum turning rate θ. At every time step of size τ, individuals simultaneously determine a new direction of travel by considering neighbors within two behavioral zones: a zone of repulsion of radius r r about the individual and a zone of orientation and attraction of inner radius r r and outer radius r p. The latter includes a blind area, defined as a circular sector with central angle (2π η), for which neighbors within the zone are undetectable. These zones are used to define behavioral rules of motion. First, if individual i finds agents within its zone of repulsion, it repels away from them by orienting its direction away from their average relative directions. Its desired direction of travel in the next time step is given by v i (t + τ) = j i p j (t) p i (t) p j (t) p i (t), (1) and normalized as ˆv i (t + τ) = v i(t+τ) v i (t+τ), assuming v i(t + τ) 0. If v i (t + τ) = 0, agent i maintains its previous direction of travel giving ˆv i (t + τ) = ˆv i (t). If agents are not found within individual i s zone of repulsion, then it will align with (by averaging the directions of travel of itself and its neighbors) and feel 4

5 an attraction towards (by orienting itself towards the average relative directions of) agents within the zone of orientation and attraction. Its desired direction of travel is given by the weighted sum of two terms: v i (t + τ) = ω a j i p j (t) p i (t) p j (t) p i (t) + ω v j (t) o v j j (t), (2) where ω a and ω o are the weights of attraction and orientation, respectively. This vector is normalized assuming v i (t + τ) 0. If v i (t + τ) = 0, then agent i maintains its previous direction of travel. Noise effects are incorporated by rotating agent i s desired direction ˆv i (t+τ) by an angle drawn from a circularly wrapped normal distribution of mean 0 and standard deviation σ. Also, since individuals can only turn θτ radians in one time step, if the angle between ˆv i (t) and ˆv i (t + τ) is greater than θτ, individuals do not achieve their desired direction, and instead rotate θτ towards it. Finally, each agent s position is updated simultaneously as p i (t + τ) = p i (t) + sˆv i (t + τ)τ. (3) To begin a simulation, individuals are placed in a bounded region with random positions and directions of travel. Simulations are run for approximately 3000 time steps. The fish schooling model is very well suited for the GPU because of its high arithmetic intensity, relatively simple data structure needs, and its complete data parallelism for ensemble simulations. One may also quite easily assess the performance and accuracy of simulations on the GPU by comparing with results from the host workstation. 5

6 3 The Graphics Processor Unit - A New Data Parallel Computing Device The GPU was originally designed as a dedicated graphics rendering device for workstations or PC s. Due to the forceful economic pressure of the fast-growing interactive entertainment industry, the GPU has evolved rapidly into a very powerful general-purpose computing device (see Figure 1 [8]). The modern GPU is small enough to fit into most desktops and workstations, creating a super computer on the desktop. It allows users to program with a high level language. Although it currently supports only 32-bit floating point precision, that will change soon. The GPU is best suited for parallel processing applications and computations with high stream processing floating point arithmetic intensity and high computation per memory access computation [6]. It is especially wellsuited to SIMD applications, since the calculation is able to hide the memory access latency. But it is still a specialized processor. It is not so efficient for applications which are memory access intensive, require double precision, or involve many logical operations on integer data or branches [6]. GFLOPS Figure 1: Observed peak GFLOPS on GPU, compared with theoretical peak GFLOPS on CPU. [8] The NVIDIA 8800 GTX that we use is the new generation with stream processing instead of the pipelining that was characteristic of previous generations of GPUs. The 480mm 2 surface area of the NVIDIA 8800 GTX chip contains 768MB RAM and 681 million transistors to construct 128 stream processors, 6

7 grouped into 16 clusters of multiprocessors (8 streaming processors in each multiprocessor) as shown in Figure 2 [8]. The 8 processors in one multiprocessor share a 16 KB low latency shared memory, which brings data closer to the arithmetic unit. The maximum observed bandwidth between system and device memory is about 2GB/second. The global memory can be accessed with the same speed but with much higher latency. Device Multiprocessor N Multiprocessor 2 Multiprocessor 1 Shared Memory Registers Processor 1 Registers Processor 2 Registers Processor M Instruction Unit Constant Cache Device Memory Figure 2: Hardware Model : A set of SIMD multiprocessors with on-chip shared memory [8]. Until recently, researchers interested in employing GPUs for scientific computing applications had to use a graphics application programming interface (API) such as OpenGL. They would first recast their model into a graphics API, and then trick the GPU into running it as a graphics code. This is a very uncommon programming pattern and made migrating non-graphics applications onto GPU a significant challenge. Just last year, NVIDIA introduced the CUDA Software Development Kit (SDK), which supplies an essential high-level development environment for general purpose computation applications on the NVIDIA GPU. This minimizes the learning curve for beginners to access the low-level hardware and gives the user 7

8 much more development flexibility than graphics programming languages [8]. A productive way of computing on the device is to maximize the use of fast but small shared memory and to minimize the accesses to the slow global memory. To do this, we must first develop a strategy to make the data fit into the limited shared memory. In our computation, to obtain memory level parallelization in each block, the threads first load their target agent s information from global memory to shared memory simultaneously, and then perform the computation involving the target agent with all the data in shared memory. After this computation, each thread copies the data of its target agent back to global memory in parallel. A kernel is able to run on many thread blocks in parallel without any communication among the blocks, which is important because cooperation among different thread blocks is not yet supported. Within one thread block, all of the threads can synchronize through their hazard-free shared memory. 4 Parallelization There are two ways to parallelize the simulation. One is to parallelize across realizations: independent realizations with different initial conditions are performed in parallel. Such realizations are uncoupled and therefore do not need to communicate with each other. Each realization communicates with the host to get the initial data and transfers the results at the end of the simulation. Another is to parallelize within a single realization: the domain of state variables is partitioned into smaller sub-domains, and all the threads cooperate with each other while manipulating their own sub-domain in parallel to advance the simulation. We use both methods of parallelization for the fish schooling model. During the parallelization within a single realization, different sub-domains of the fish schooling model are tightly coupled. Even for an application which is 8

9 ideal for the GPU architecture, the user still needs to be careful to deal with the limited shared memory and the large latency of the global memory. The efficient concurrent use of the shared memory and global memory can supply data to the arithmetic unit with a reasonable rate and results in good performance. We will show how our computation for the fish schooling model can be structured to fit within the GPU architecture constraints. 4.1 Parallelization Across Realizations Parallelization across realizations is a straightforward and effective way to improve performance for this application. Using multiple blocks for ensemble simulations keeps the GPU s large arithmetic capacities fully used and hides the shared and global memory access latency to achieve good performance. First, the kernel instructions are distributed among thread blocks, and each of them is in charge of a single realization. The system state for the fish schooling model is defined as Px j i (t), Pyj i (t), V xj i (t), V yj i (t), where Pxj i (t), Pyj i (t) represent the x and y components of the positions of the fish agent, V x j i (t), V yj i (t) represent the x and y components of the velocities of the agent, i is the block id, j is the id of the fish agent, and t represents time. Each thread block with id i = c computes on the subset Px j c (t), Pyj c (t), V xj c (t), V yj c (t) and generates its own initial state variables Px j c(0), Py j c(0), V x j c(0), V y j c(0) using the parallel Mersenne Twister (MT) [7, 9] random number generator. The desired results are stored in the final state vectors Px j i (t f), Py j i (t f), V x j i (t f), V y j i (t f) after t f simulation steps. We use an intermediate data structure on the device to minimize the transfer between the host and the device, and group a few small transfers into a big transfer to reduce the overhead for each transfer. 9

10 4.2 Parallelization Within A Single Realization We briefly review the sequential algorithm before demonstrating our strategy for parallelization across individuals. To begin a simulation, individual agents are placed in a bounded region with random positions and headings. Then, the simulation is advanced for 3000 steps to reach a steady state. By steady state we mean that the group has self-organized and arrived at a particular type of collective motion. During a time step, each agent has to compute the influence from all other agents on itself. To compute the influence on an agent, we first calculate the distance between this goal agent and all other agents, and then use a few variables to record the influence coming from the different zones. Then we calculate the net influences for this goal agent, including a random noise term and save it to an influence array. In a system with N fish agents, the time complexity for the distance calculation on a CPU is O(N N), and for the influence calculation it is O(N). After determining all the influences for all of the agents in the fish school, we update the positions and directions of the agents simultaneously, based on the influence array at the current time step. The on-chip shared memory is very limited but has much lower latency for memory instructions than the global memory. It takes about 2 clock cycles to access data in the shared memory, with an additional clock cycles of memory latency to read or write from the global memory [8]. To effectively make use of the GPU, the on-chip shared memory must be used as much possible. In theory it is better to have at least 128 threads to get the best performance improvement [8], but we also need take the limited shared memory size into account. Thus, there are restrictions on the total number of fish agents stored in the shared memory and the number of threads in each block. The main method of parallelizing within a single realization is to decompose the problem domain into smaller sub-domains to compute in parallel. In our 10

11 100 threads per block Each thread is responsible one individual fish agent 1D Thread Blocks 160 Blocks are used to maximize the use of GPU resources B1T1 B2T1 B160T1 B1T2 B2T2 B160T2 B1T100 B2T100 B160T100 Figure 3: Decomposition of the fish schooling model for CUDA threads and blocks for simulating schools of size N = 100. simulation, the system state vectors Px j i (t), Pyj i (t), V xj i (t), V yj i (t) for each block i are partitioned into subsets to be manipulated by each thread. Assume that there are N fish agents in one realization simulated with n threads. Then we need order of k = N/n time to calculate the influences on each agent. To begin a simulation, all agents are initialized with random positions and directions of travel in parallel. At each time step, n agents are loaded by managing one agent in each thread k times. For schools of size N = 100, we set k = 1 for best performance. When simulating larger schools, one must take k > 1. The decomposition of the schooling model for schools of size N = 100 is shown in Figure 3. The system state variables loaded to shared memory at this stage are called the goal agents. Each thread has only one goal agent held at a time and is responsible for computing the influences on this goal agent. Thus, at one time step, each thread operates on one k-element subvector. To calculate 11

12 the distance between the goal agent and all the other agents, the positions and directions of all the other agents are needed. A temporary array of size n in shared memory is used to load n agents at a time, with each thread loading one agent. The program will keep operating until all the desired data has been loaded and utilized for calculation for all the goal agents if k > 1. Then, each thread can calculate the influence information for its own goal agent. At the last simulation step, each thread records the results into its own influence subvector. Thus the distance calculation complexity is reduced to O(N), and the influence calculation complexity is reduced to O(1). After all the influence calculations are completed, each thread updates its system state subvector with its influence subvector. The process of parallelizing across realizations and within a single realization is shown in Figure 4. The concurrent reads to the same data are supported by the hardware. Thus, during each calculation, each thread can read the full system state vector without conflicts. Grid Block (0) Block (m) Shared Memory Shared Memory (0) (1) (n) (0) (n) Registers Thread (0) Registers Thread (1)... Registers Thread (n)... Registers Thread (0)... Registers Thread (n) Local Memory Local Memory Local Memory Local Memory Local Memory (0,0) (0,1) (0,n) (m,0) (m,n) Global Memory Figure 4: m blocks and n threads are used to parallelize across realizations and within a single realization. First, each thread loads information of one agent as its goal agent from the device memory to shared memory. Second, each thread loads its goal agent to an intermediate array in shared memory. Third, each thread uses the full data in the intermediate array to compute the influences on its own goal agent. The loading continues until all of the influences to the goal agent have been computed. Then, each thread saves its influence to the influence record array (this reuses the previous array to save shared memory). At last, each thread updates information of its goal agent to the system state vector on the device. 12

13 In summary, during the entire simulation most calculations are based on data in the low-latency shared memory and multiple blocks running in parallel hide the memory access latency and maximize the use of the arithmetic units. This results in a big performance improvement. 4.3 Random Number Generation A huge number of high quality pseudorandom numbers are required to generate initial conditions and to add noise to our calculations at each time step. Statistical results can only be trusted if the independence of samples can be guaranteed. Pregenerating a large random number sequence is not a good choice because it consumes too much time for memory accesses. To generate random numbers for our application, we use the Mersenne Twister (MT) [7, 9] algorithm. It has passed many statistical randomness tests including the stringent Diehard tests. It is able to generate high quality, long period random sequences with high order of dimensional equidistribution, and makes efficient use of memory. Thus, MT is perfect for our applications. A modified version of Eric Mills multi threaded C implementation [9] of the MT algorithm was employed to generate the random numbers for each thread in parallel. To generate a large number of random numbers efficiently, the shared memory based implementation is employed. 5 Simulation Results and Performance Our simulations were performed on an NVIDIA GeForce 8800GTX installed on a host workstation with Intel Pentium 3.00GHz CPU and 3.50GB of RAM with physical Address Extension. We explored the effects of varying the ratio of orientation to alignment tendencies r = ω o /ω a for N = 100 member schools. For each r considered,

14 steady-state simulations of the schooling model were performed on the GPU. The remainder of the parameters were set to r r = 1, r p = 7, η = 350, s = 1, τ = 0.2, σ = 0.01, and θ = 115. For r close to zero, attraction tendencies dominate over alignment tendencies and groups exhibit swarming behavior. As r is increased past 1 (equal orientation and alignment), schools become increasingly more polarized, forming highly aligned mobile groups for large r. Group polarization, defined as P(t) = 1 N N ˆv i (t), i=1 serves as a good measure of the mobility of a school. See Figure 5 for a plot of the average group polarization as a function of r. The simulation performance in generating these results was extraordinary. The parallel GPU simulation is about times faster than the corresponding sequential simulation of the same model on the CPU. The polarization curve in Figure 5 took a few minutes to generate on the GPU. The corresponding serial version on the CPU would have taken more than 8 hours. 6 Conclusions We showed how stochastic steady-state simulations of an individual-based fish schooling model can be efficiently performed in parallel on a general-purpose GPU and observed speedups of about times for our parallelized code on the GPU over the corresponding sequential code running on the CPU of the host workstation. With such processing capabilities at our fingertips, it is easy to compare the effects of different modeling assumptions and parameters, allowing us to refine our understanding of how behavior at the population-level arises from individual-level interactions. The GPU has the power to revolutionize the 14

15 1 (c) Average Group Polarization (b) r=2 r= (a) r= r Figure 5: Group polarization as a function of r averaged over 1120 steady-state replicate simulations run in parallel on the GPU. (a) A realization of the swarm state (r = 0.125), (b) a realization of the dynamic parallel state (r = 2), (c) a realization of the highly parallel state (r = 1000). way we do scientific computation by bringing the processing power of a cluster to the home desktop. 7 ACKNOWLEDGMENTS This work was supported in part by the U.S. Department of Energy under DOE award No. DE-FG02-04ER25621, by NIH Grant EB007511, by the Institute for Collaborative Biotechnologies through grants DAAD19-03-D004 from the U.S. Army Research Office, and by National Science Foundation Grant NSF References [1] S. Camazine, J. L. Deneubourg, N. R. Franks, J. Sneyd, G. Theraulaz, and E. Bonabeau. Self-Organization in Biological Systems. Princeton University Press, Princeton,

16 [2] I. D. Couzin, J. Krause, N. R. Franks, and S. A. Levin. Effective leadership and decision making in animal groups on the move. Nature, 433: , [3] GPGPU-Home. GPGPU homepage, [4] H. Li, A. Kolpas, L. Petzold, and J. Moehlis. Parallel simulation for a fish schooling model on a general-purpose graphics processing unit. Concurrency and Computation: Practice and Experience, to appear. [5] H. Li and L. Petzold. Stochastic simulation of biochemical systems on the graphics processing unit. Technical report, Department of Computer Science, University of California, Santa Barbara, Submitted. [6] W. m. Hwu. Gpu computing programming, performance, and scalability. In Proceedings of the Block Island Workshop on Cooperative Control [7] M. Matsumoto and T. Nishimura. Mersenne Twister: a 623-dimensionally equidistributed uniform pseudo-random number generator. ACM Transactions on Modeling and Computer Simulation (TOMACS), 8:3 30, [8] NVIDIA Corporation. NVIDIA CUDA Compute Unified Device Architecture Programming Guide, [9] NVIDIA Forums members. NVIDIA forums, [10] J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Krger, A. E. Lefohn, and T. J. Purcell. A survey of general-purpose computation on graphics hardware. In Eurographics 2005, State of the Art Reports, pages 21 51, Aug

Parallel Simulation for a Fish Schooling Model on a General-Purpose Graphics Processing Unit

Parallel Simulation for a Fish Schooling Model on a General-Purpose Graphics Processing Unit CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2007; 00:1 11 [Version: 2002/09/19 v2.02] Parallel Simulation for a Fish Schooling Model on a General-Purpose Graphics

More information

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,

More information

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

B. Tech. Project Second Stage Report on

B. Tech. Project Second Stage Report on B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic

More information

GPUs and GPGPUs. Greg Blanton John T. Lubia

GPUs and GPGPUs. Greg Blanton John T. Lubia GPUs and GPGPUs Greg Blanton John T. Lubia PROCESSOR ARCHITECTURAL ROADMAP Design CPU Optimized for sequential performance ILP increasingly difficult to extract from instruction stream Control hardware

More information

CS427 Multicore Architecture and Parallel Computing

CS427 Multicore Architecture and Parallel Computing CS427 Multicore Architecture and Parallel Computing Lecture 6 GPU Architecture Li Jiang 2014/10/9 1 GPU Scaling A quiet revolution and potential build-up Calculation: 936 GFLOPS vs. 102 GFLOPS Memory Bandwidth:

More information

Journal of Universal Computer Science, vol. 14, no. 14 (2008), submitted: 30/9/07, accepted: 30/4/08, appeared: 28/7/08 J.

Journal of Universal Computer Science, vol. 14, no. 14 (2008), submitted: 30/9/07, accepted: 30/4/08, appeared: 28/7/08 J. Journal of Universal Computer Science, vol. 14, no. 14 (2008), 2416-2427 submitted: 30/9/07, accepted: 30/4/08, appeared: 28/7/08 J.UCS Tabu Search on GPU Adam Janiak (Institute of Computer Engineering

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

high performance medical reconstruction using stream programming paradigms

high performance medical reconstruction using stream programming paradigms high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming

More information

From Brook to CUDA. GPU Technology Conference

From Brook to CUDA. GPU Technology Conference From Brook to CUDA GPU Technology Conference A 50 Second Tutorial on GPU Programming by Ian Buck Adding two vectors in C is pretty easy for (i=0; i

More information

Numerical Simulation on the GPU

Numerical Simulation on the GPU Numerical Simulation on the GPU Roadmap Part 1: GPU architecture and programming concepts Part 2: An introduction to GPU programming using CUDA Part 3: Numerical simulation techniques (grid and particle

More information

PARTICLE SWARM OPTIMIZATION (PSO)

PARTICLE SWARM OPTIMIZATION (PSO) PARTICLE SWARM OPTIMIZATION (PSO) J. Kennedy and R. Eberhart, Particle Swarm Optimization. Proceedings of the Fourth IEEE Int. Conference on Neural Networks, 1995. A population based optimization technique

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied

More information

Threading Hardware in G80

Threading Hardware in G80 ing Hardware in G80 1 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA 2 3D 3D API: API: OpenGL OpenGL or or Direct3D Direct3D GPU Command &

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

Performance potential for simulating spin models on GPU

Performance potential for simulating spin models on GPU Performance potential for simulating spin models on GPU Martin Weigel Institut für Physik, Johannes-Gutenberg-Universität Mainz, Germany 11th International NTZ-Workshop on New Developments in Computational

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka USE OF FOR Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in Prague Mini workshop on advanced numerical methods

More information

Using GPUs to compute the multilevel summation of electrostatic forces

Using GPUs to compute the multilevel summation of electrostatic forces Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of

More information

Evaluation Of The Performance Of GPU Global Memory Coalescing

Evaluation Of The Performance Of GPU Global Memory Coalescing Evaluation Of The Performance Of GPU Global Memory Coalescing Dae-Hwan Kim Department of Computer and Information, Suwon Science College, 288 Seja-ro, Jeongnam-myun, Hwaseong-si, Gyeonggi-do, Rep. of Korea

More information

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture

More information

Behavioral Simulations in MapReduce

Behavioral Simulations in MapReduce Behavioral Simulations in MapReduce Guozhang Wang, Cornell University with Marcos Vaz Salles, Benjamin Sowell, Xun Wang, Tuan Cao, Alan Demers, Johannes Gehrke, and Walker White MSR DMX August 20, 2010

More information

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27 1 / 27 GPU Programming Lecture 1: Introduction Miaoqing Huang University of Arkansas 2 / 27 Outline Course Introduction GPUs as Parallel Computers Trend and Design Philosophies Programming and Execution

More information

Graphics Processor Acceleration and YOU

Graphics Processor Acceleration and YOU Graphics Processor Acceleration and YOU James Phillips Research/gpu/ Goals of Lecture After this talk the audience will: Understand how GPUs differ from CPUs Understand the limits of GPU acceleration Have

More information

General-purpose computing on graphics processing units (GPGPU)

General-purpose computing on graphics processing units (GPGPU) General-purpose computing on graphics processing units (GPGPU) Thomas Ægidiussen Jensen Henrik Anker Rasmussen François Rosé November 1, 2010 Table of Contents Introduction CUDA CUDA Programming Kernels

More information

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Joined Advanced Student School (JASS) 2009 March 29 - April 7, 2009 St. Petersburg, Russia G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Dmitry Puzyrev St. Petersburg State University Faculty

More information

WHY PARALLEL PROCESSING? (CE-401)

WHY PARALLEL PROCESSING? (CE-401) PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

More information

Current Trends in Computer Graphics Hardware

Current Trends in Computer Graphics Hardware Current Trends in Computer Graphics Hardware Dirk Reiners University of Louisiana Lafayette, LA Quick Introduction Assistant Professor in Computer Science at University of Louisiana, Lafayette (since 2006)

More information

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms CS 590: High Performance Computing GPU Architectures and CUDA Concepts/Terms Fengguang Song Department of Computer & Information Science IUPUI What is GPU? Conventional GPUs are used to generate 2D, 3D

More information

CUDA Performance Optimization. Patrick Legresley

CUDA Performance Optimization. Patrick Legresley CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

CSE 160 Lecture 24. Graphical Processing Units

CSE 160 Lecture 24. Graphical Processing Units CSE 160 Lecture 24 Graphical Processing Units Announcements Next week we meet in 1202 on Monday 3/11 only On Weds 3/13 we have a 2 hour session Usual class time at the Rady school final exam review SDSC

More information

Lecture 1: Gentle Introduction to GPUs

Lecture 1: Gentle Introduction to GPUs CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 1: Gentle Introduction to GPUs Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Who Am I? Mohamed

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

Lecture 7: Parallel Processing

Lecture 7: Parallel Processing Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction

More information

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany Computing on GPUs Prof. Dr. Uli Göhner DYNAmore GmbH Stuttgart, Germany Summary: The increasing power of GPUs has led to the intent to transfer computing load from CPUs to GPUs. A first example has been

More information

Variants of Mersenne Twister Suitable for Graphic Processors

Variants of Mersenne Twister Suitable for Graphic Processors Variants of Mersenne Twister Suitable for Graphic Processors Mutsuo Saito 1, Makoto Matsumoto 2 1 Hiroshima University, 2 University of Tokyo August 16, 2010 This study is granted in part by JSPS Grant-In-Aid

More information

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo N-Body Simulation using CUDA CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo Project plan Develop a program to simulate gravitational

More information

GPU Implementation of a Multiobjective Search Algorithm

GPU Implementation of a Multiobjective Search Algorithm Department Informatik Technical Reports / ISSN 29-58 Steffen Limmer, Dietmar Fey, Johannes Jahn GPU Implementation of a Multiobjective Search Algorithm Technical Report CS-2-3 April 2 Please cite as: Steffen

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory

More information

GPGPU. Peter Laurens 1st-year PhD Student, NSC

GPGPU. Peter Laurens 1st-year PhD Student, NSC GPGPU Peter Laurens 1st-year PhD Student, NSC Presentation Overview 1. What is it? 2. What can it do for me? 3. How can I get it to do that? 4. What s the catch? 5. What s the future? What is it? Introducing

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics H. Y. Schive ( 薛熙于 ) Graduate Institute of Physics, National Taiwan University Leung Center for Cosmology and Particle Astrophysics

More information

GPU for HPC. October 2010

GPU for HPC. October 2010 GPU for HPC Simone Melchionna Jonas Latt Francis Lapique October 2010 EPFL/ EDMX EPFL/EDMX EPFL/DIT simone.melchionna@epfl.ch jonas.latt@epfl.ch francis.lapique@epfl.ch 1 Moore s law: in the old days,

More information

QR Decomposition on GPUs

QR Decomposition on GPUs QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of

More information

REDUCING BEAMFORMING CALCULATION TIME WITH GPU ACCELERATED ALGORITHMS

REDUCING BEAMFORMING CALCULATION TIME WITH GPU ACCELERATED ALGORITHMS BeBeC-2014-08 REDUCING BEAMFORMING CALCULATION TIME WITH GPU ACCELERATED ALGORITHMS Steffen Schmidt GFaI ev Volmerstraße 3, 12489, Berlin, Germany ABSTRACT Beamforming algorithms make high demands on the

More information

SEASHORE / SARUMAN. Short Read Matching using GPU Programming. Tobias Jakobi

SEASHORE / SARUMAN. Short Read Matching using GPU Programming. Tobias Jakobi SEASHORE SARUMAN Summary 1 / 24 SEASHORE / SARUMAN Short Read Matching using GPU Programming Tobias Jakobi Center for Biotechnology (CeBiTec) Bioinformatics Resource Facility (BRF) Bielefeld University

More information

Lecture 7: Parallel Processing

Lecture 7: Parallel Processing Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction

More information

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum GPU & High Performance Computing (by NVIDIA) CUDA Compute Unified Device Architecture 29.02.2008 Florian Schornbaum GPU Computing Performance In the last few years the GPU has evolved into an absolute

More information

Analysis Report. Number of Multiprocessors 3 Multiprocessor Clock Rate Concurrent Kernel Max IPC 6 Threads per Warp 32 Global Memory Bandwidth

Analysis Report. Number of Multiprocessors 3 Multiprocessor Clock Rate Concurrent Kernel Max IPC 6 Threads per Warp 32 Global Memory Bandwidth Analysis Report v3 Duration 932.612 µs Grid Size [ 1024,1,1 ] Block Size [ 1024,1,1 ] Registers/Thread 32 Shared Memory/Block 28 KiB Shared Memory Requested 64 KiB Shared Memory Executed 64 KiB Shared

More information

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

GPU Computation Strategies & Tricks. Ian Buck NVIDIA GPU Computation Strategies & Tricks Ian Buck NVIDIA Recent Trends 2 Compute is Cheap parallelism to keep 100s of ALUs per chip busy shading is highly parallel millions of fragments per frame 0.5mm 64-bit

More information

Why GPUs? Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)

Why GPUs? Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD) Why GPUs? Robert Strzodka (MPII), Dominik Göddeke G (TUDo( TUDo), Dominik Behr (AMD) Conference on Parallel Processing and Applied Mathematics Wroclaw, Poland, September 13-16, 16, 2009 www.gpgpu.org/ppam2009

More information

PowerVR Hardware. Architecture Overview for Developers

PowerVR Hardware. Architecture Overview for Developers Public Imagination Technologies PowerVR Hardware Public. This publication contains proprietary information which is subject to change without notice and is supplied 'as is' without warranty of any kind.

More information

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms Complexity and Advanced Algorithms Introduction to Parallel Algorithms Why Parallel Computing? Save time, resources, memory,... Who is using it? Academia Industry Government Individuals? Two practical

More information

Automatic Scaling Iterative Computations. Aug. 7 th, 2012

Automatic Scaling Iterative Computations. Aug. 7 th, 2012 Automatic Scaling Iterative Computations Guozhang Wang Cornell University Aug. 7 th, 2012 1 What are Non-Iterative Computations? Non-iterative computation flow Directed Acyclic Examples Batch style analytics

More information

Height field ambient occlusion using CUDA

Height field ambient occlusion using CUDA Height field ambient occlusion using CUDA 3.6.2009 Outline 1 2 3 4 Theory Kernel 5 Height fields Self occlusion Current methods Marching several directions from each fragment Sampling several times along

More information

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1 Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei

More information

Accelerating Molecular Modeling Applications with Graphics Processors

Accelerating Molecular Modeling Applications with Graphics Processors Accelerating Molecular Modeling Applications with Graphics Processors John Stone Theoretical and Computational Biophysics Group University of Illinois at Urbana-Champaign Research/gpu/ SIAM Conference

More information

Lecture 1: Introduction and Computational Thinking

Lecture 1: Introduction and Computational Thinking PASI Summer School Advanced Algorithmic Techniques for GPUs Lecture 1: Introduction and Computational Thinking 1 Course Objective To master the most commonly used algorithm techniques and computational

More information

By: Tomer Morad Based on: Erik Lindholm, John Nickolls, Stuart Oberman, John Montrym. NVIDIA TESLA: A UNIFIED GRAPHICS AND COMPUTING ARCHITECTURE In IEEE Micro 28(2), 2008 } } Erik Lindholm, John Nickolls,

More information

CUDA Programming Model

CUDA Programming Model CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming

More information

Distributed Virtual Reality Computation

Distributed Virtual Reality Computation Jeff Russell 4/15/05 Distributed Virtual Reality Computation Introduction Virtual Reality is generally understood today to mean the combination of digitally generated graphics, sound, and input. The goal

More information

Double-Precision Matrix Multiply on CUDA

Double-Precision Matrix Multiply on CUDA Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices

More information

Warps and Reduction Algorithms

Warps and Reduction Algorithms Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum

More information

Advances in Metaheuristics on GPU

Advances in Metaheuristics on GPU Advances in Metaheuristics on GPU 1 Thé Van Luong, El-Ghazali Talbi and Nouredine Melab DOLPHIN Project Team May 2011 Interests in optimization methods 2 Exact Algorithms Heuristics Branch and X Dynamic

More information

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS Agenda Forming a GPGPU WG 1 st meeting Future meetings Activities Forming a GPGPU WG To raise needs and enhance information sharing A platform for knowledge

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

GPU Programming Using NVIDIA CUDA

GPU Programming Using NVIDIA CUDA GPU Programming Using NVIDIA CUDA Siddhante Nangla 1, Professor Chetna Achar 2 1, 2 MET s Institute of Computer Science, Bandra Mumbai University Abstract: GPGPU or General-Purpose Computing on Graphics

More information

Programmable Graphics Hardware (GPU) A Primer

Programmable Graphics Hardware (GPU) A Primer Programmable Graphics Hardware (GPU) A Primer Klaus Mueller Stony Brook University Computer Science Department Parallel Computing Explained video Parallel Computing Explained Any questions? Parallelism

More information

Center for Computational Science

Center for Computational Science Center for Computational Science Toward GPU-accelerated meshfree fluids simulation using the fast multipole method Lorena A Barba Boston University Department of Mechanical Engineering with: Felipe Cruz,

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and

More information

Integrating GPUs as fast co-processors into the existing parallel FE package FEAST

Integrating GPUs as fast co-processors into the existing parallel FE package FEAST Integrating GPUs as fast co-processors into the existing parallel FE package FEAST Dipl.-Inform. Dominik Göddeke (dominik.goeddeke@math.uni-dortmund.de) Mathematics III: Applied Mathematics and Numerics

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

Analysis-driven Engineering of Comparison-based Sorting Algorithms on GPUs

Analysis-driven Engineering of Comparison-based Sorting Algorithms on GPUs AlgoPARC Analysis-driven Engineering of Comparison-based Sorting Algorithms on GPUs 32nd ACM International Conference on Supercomputing June 17, 2018 Ben Karsin 1 karsin@hawaii.edu Volker Weichert 2 weichert@cs.uni-frankfurt.de

More information

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs C.-C. Su a, C.-W. Hsieh b, M. R. Smith b, M. C. Jermy c and J.-S. Wu a a Department of Mechanical Engineering, National Chiao Tung

More information

Very fast simulation of nonlinear water waves in very large numerical wave tanks on affordable graphics cards

Very fast simulation of nonlinear water waves in very large numerical wave tanks on affordable graphics cards Very fast simulation of nonlinear water waves in very large numerical wave tanks on affordable graphics cards By Allan P. Engsig-Karup, Morten Gorm Madsen and Stefan L. Glimberg DTU Informatics Workshop

More information

CUDA (Compute Unified Device Architecture)

CUDA (Compute Unified Device Architecture) CUDA (Compute Unified Device Architecture) Mike Bailey History of GPU Performance vs. CPU Performance GFLOPS Source: NVIDIA G80 = GeForce 8800 GTX G71 = GeForce 7900 GTX G70 = GeForce 7800 GTX NV40 = GeForce

More information

Advanced CUDA Optimization 1. Introduction

Advanced CUDA Optimization 1. Introduction Advanced CUDA Optimization 1. Introduction Thomas Bradley Agenda CUDA Review Review of CUDA Architecture Programming & Memory Models Programming Environment Execution Performance Optimization Guidelines

More information

Using Graphics Chips for General Purpose Computation

Using Graphics Chips for General Purpose Computation White Paper Using Graphics Chips for General Purpose Computation Document Version 0.1 May 12, 2010 442 Northlake Blvd. Altamonte Springs, FL 32701 (407) 262-7100 TABLE OF CONTENTS 1. INTRODUCTION....1

More information

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan Course II Parallel Computer Architecture Week 2-3 by Dr. Putu Harry Gunawan www.phg-simulation-laboratory.com Review Review Review Review Review Review Review Review Review Review Review Review Processor

More information

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to

More information

Algorithm Engineering with PRAM Algorithms

Algorithm Engineering with PRAM Algorithms Algorithm Engineering with PRAM Algorithms Bernard M.E. Moret moret@cs.unm.edu Department of Computer Science University of New Mexico Albuquerque, NM 87131 Rome School on Alg. Eng. p.1/29 Measuring and

More information

Parallelism and Concurrency. COS 326 David Walker Princeton University

Parallelism and Concurrency. COS 326 David Walker Princeton University Parallelism and Concurrency COS 326 David Walker Princeton University Parallelism What is it? Today's technology trends. How can we take advantage of it? Why is it so much harder to program? Some preliminary

More information

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA GPGPU LAB Case study: Finite-Difference Time- Domain Method on CUDA Ana Balevic IPVS 1 Finite-Difference Time-Domain Method Numerical computation of solutions to partial differential equations Explicit

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics Why GPU? Chapter 1 Graphics Hardware Graphics Processing Unit (GPU) is a Subsidiary hardware With massively multi-threaded many-core Dedicated to 2D and 3D graphics Special purpose low functionality, high

More information

General Purpose Computing on Graphical Processing Units (GPGPU(

General Purpose Computing on Graphical Processing Units (GPGPU( General Purpose Computing on Graphical Processing Units (GPGPU( / GPGP /GP 2 ) By Simon J.K. Pedersen Aalborg University, Oct 2008 VGIS, Readings Course Presentation no. 7 Presentation Outline Part 1:

More information

GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP

GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP INTRODUCTION or With the exponential increase in computational power of todays hardware, the complexity of the problem

More information

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3

More information

Benchmark 1.a Investigate and Understand Designated Lab Techniques The student will investigate and understand designated lab techniques.

Benchmark 1.a Investigate and Understand Designated Lab Techniques The student will investigate and understand designated lab techniques. I. Course Title Parallel Computing 2 II. Course Description Students study parallel programming and visualization in a variety of contexts with an emphasis on underlying and experimental technologies.

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information