Efficient Parallel Simulation of an Individual-Based Fish Schooling Model on a Graphics Processing Unit
|
|
- Nathaniel Greer
- 5 years ago
- Views:
Transcription
1 Efficient Parallel Simulation of an Individual-Based Fish Schooling Model on a Graphics Processing Unit Hong Li Department of Computer Science University of California, Santa Barbara, CA hongli@cs.ucsb.edu Allison Kolpas Department of Mathematics University of California, Santa Barbara, CA allie@math.ucsb.edu Linda Petzold Department of Computer Science Department of Mechanical Engineering University of California, Santa Barbara, CA petzold@cs.ucsb.edu Jeff Moehlis Department of Mechanical Engineering University of California, Santa Barbara, CA moehlis@engineering.ucsb.edu April 8, 2008
2 Abstract Due to their low cost and high performance processing capabilities, graphics processing units (GPUs) have become an attractive alternative to clusters for some scientific computing applications. In this paper we show how stochastic simulation of an individual-based fish schooling model can be efficiently carried out on a general-purpose GPU. We describe our implementation and present computational results to illustrate the power of this new technology. 1 Introduction Driven by the video gaming industry, the Graphics Processing Unit (GPU) has evolved into an inexpensive yet powerful high speed computing device for scientific applications. The GPU has a highly parallel structure, high memory bandwidth, and more transistors devoted to data processing than to data caching and flow control than a CPU [8]. Problems that can be implemented with stream processing and limited memory are well-suited for the GPU. Single Instruction Multiple Data (SIMD) computation, which involves a large number of totally independent records being processed by the same sequence of operations simultaneously, is an ideal type of GPU application. In recent years, computation on the general-purpose GPU (GPGPU) has become an active research field, with a wide range of applications including cellular automata, particle systems, fluid dynamics and computational geometry [3, 10, 5, 4]. Previous generation GPUs have required non-graphics applications to be recast into graphics computation through a graphics application programming interface (API). Last year, NVIDIA released a Compute Unified Device Architecture (CUDA) toolkit for their GPUs, providing general purpose functionality with a C-like language for non-graphics application. 2
3 Fish schooling is an important example of self-organized collective motion in animal groups. The collective behavior of the group emerges from local interactions of individuals with their neighbors, without any regard to a leader, template, or other external cue. Such groups can be composed of hundreds to millions of members, with all individuals responding rapidly to their neighbors to maintain the collective motion. In this paper we show how long-time stochastic simulations of an individual-based fish schooling model can be efficiently performed on a CUDA-enabled GPU. In the schooling model, each organism or agent is treated individually, with rules specifying its dynamics and interactions with other agents. Noise is included to account for imperfect sensing and processing. For different values of the parameters, different schooling behaviors emerge, including a swarm state in which individuals incoherently move about a central location, and a mobile state, in which individuals travel in an aligned polarized group. A large number of realizations are necessary to accurately determine the statistics of the collective motion. Simulations of the schooling model carry a high computational cost and can benefit from parallel processing. In the model, each agent updates its direction of travel based on the positions and directions of travel of all other agents. This computation can be performed in parallel across individuals within a single realization. In addition, the realizations can be performed in parallel. The architecture of the GPU is quite suitable for both types of parallel processing. In the following, we first describe the details of the individual-based fishschooling model. We then review the features of the GPU and show how parallel processing both across individuals within a single realization and across realizations can be implemented on a CUDA-enabled GPU to efficiently perform ensemble long-time simulations of the schooling model. 3
4 2 Fish Schooling Model Many organisms move and travel together in self-organizing groups, including flocks of birds, schools of fish, and swarms of locusts [1]. Individual-based models (IBM) are frequently used to describe the dynamics of such groups since they can incorporate biologically realistic social interactions and behavioral responses as well as relate individual-level behaviors to emergent population-level dynamics. Here we consider a two-dimensional individual-based model for fish schooling. This model is similar to that considered in [2], but without an informed leader, and with different weights of orientation and attraction response. Groups are composed of N individuals with positions p i (t), unit directions ˆv i (t), constant speed s, and maximum turning rate θ. At every time step of size τ, individuals simultaneously determine a new direction of travel by considering neighbors within two behavioral zones: a zone of repulsion of radius r r about the individual and a zone of orientation and attraction of inner radius r r and outer radius r p. The latter includes a blind area, defined as a circular sector with central angle (2π η), for which neighbors within the zone are undetectable. These zones are used to define behavioral rules of motion. First, if individual i finds agents within its zone of repulsion, it repels away from them by orienting its direction away from their average relative directions. Its desired direction of travel in the next time step is given by v i (t + τ) = j i p j (t) p i (t) p j (t) p i (t), (1) and normalized as ˆv i (t + τ) = v i(t+τ) v i (t+τ), assuming v i(t + τ) 0. If v i (t + τ) = 0, agent i maintains its previous direction of travel giving ˆv i (t + τ) = ˆv i (t). If agents are not found within individual i s zone of repulsion, then it will align with (by averaging the directions of travel of itself and its neighbors) and feel 4
5 an attraction towards (by orienting itself towards the average relative directions of) agents within the zone of orientation and attraction. Its desired direction of travel is given by the weighted sum of two terms: v i (t + τ) = ω a j i p j (t) p i (t) p j (t) p i (t) + ω v j (t) o v j j (t), (2) where ω a and ω o are the weights of attraction and orientation, respectively. This vector is normalized assuming v i (t + τ) 0. If v i (t + τ) = 0, then agent i maintains its previous direction of travel. Noise effects are incorporated by rotating agent i s desired direction ˆv i (t+τ) by an angle drawn from a circularly wrapped normal distribution of mean 0 and standard deviation σ. Also, since individuals can only turn θτ radians in one time step, if the angle between ˆv i (t) and ˆv i (t + τ) is greater than θτ, individuals do not achieve their desired direction, and instead rotate θτ towards it. Finally, each agent s position is updated simultaneously as p i (t + τ) = p i (t) + sˆv i (t + τ)τ. (3) To begin a simulation, individuals are placed in a bounded region with random positions and directions of travel. Simulations are run for approximately 3000 time steps. The fish schooling model is very well suited for the GPU because of its high arithmetic intensity, relatively simple data structure needs, and its complete data parallelism for ensemble simulations. One may also quite easily assess the performance and accuracy of simulations on the GPU by comparing with results from the host workstation. 5
6 3 The Graphics Processor Unit - A New Data Parallel Computing Device The GPU was originally designed as a dedicated graphics rendering device for workstations or PC s. Due to the forceful economic pressure of the fast-growing interactive entertainment industry, the GPU has evolved rapidly into a very powerful general-purpose computing device (see Figure 1 [8]). The modern GPU is small enough to fit into most desktops and workstations, creating a super computer on the desktop. It allows users to program with a high level language. Although it currently supports only 32-bit floating point precision, that will change soon. The GPU is best suited for parallel processing applications and computations with high stream processing floating point arithmetic intensity and high computation per memory access computation [6]. It is especially wellsuited to SIMD applications, since the calculation is able to hide the memory access latency. But it is still a specialized processor. It is not so efficient for applications which are memory access intensive, require double precision, or involve many logical operations on integer data or branches [6]. GFLOPS Figure 1: Observed peak GFLOPS on GPU, compared with theoretical peak GFLOPS on CPU. [8] The NVIDIA 8800 GTX that we use is the new generation with stream processing instead of the pipelining that was characteristic of previous generations of GPUs. The 480mm 2 surface area of the NVIDIA 8800 GTX chip contains 768MB RAM and 681 million transistors to construct 128 stream processors, 6
7 grouped into 16 clusters of multiprocessors (8 streaming processors in each multiprocessor) as shown in Figure 2 [8]. The 8 processors in one multiprocessor share a 16 KB low latency shared memory, which brings data closer to the arithmetic unit. The maximum observed bandwidth between system and device memory is about 2GB/second. The global memory can be accessed with the same speed but with much higher latency. Device Multiprocessor N Multiprocessor 2 Multiprocessor 1 Shared Memory Registers Processor 1 Registers Processor 2 Registers Processor M Instruction Unit Constant Cache Device Memory Figure 2: Hardware Model : A set of SIMD multiprocessors with on-chip shared memory [8]. Until recently, researchers interested in employing GPUs for scientific computing applications had to use a graphics application programming interface (API) such as OpenGL. They would first recast their model into a graphics API, and then trick the GPU into running it as a graphics code. This is a very uncommon programming pattern and made migrating non-graphics applications onto GPU a significant challenge. Just last year, NVIDIA introduced the CUDA Software Development Kit (SDK), which supplies an essential high-level development environment for general purpose computation applications on the NVIDIA GPU. This minimizes the learning curve for beginners to access the low-level hardware and gives the user 7
8 much more development flexibility than graphics programming languages [8]. A productive way of computing on the device is to maximize the use of fast but small shared memory and to minimize the accesses to the slow global memory. To do this, we must first develop a strategy to make the data fit into the limited shared memory. In our computation, to obtain memory level parallelization in each block, the threads first load their target agent s information from global memory to shared memory simultaneously, and then perform the computation involving the target agent with all the data in shared memory. After this computation, each thread copies the data of its target agent back to global memory in parallel. A kernel is able to run on many thread blocks in parallel without any communication among the blocks, which is important because cooperation among different thread blocks is not yet supported. Within one thread block, all of the threads can synchronize through their hazard-free shared memory. 4 Parallelization There are two ways to parallelize the simulation. One is to parallelize across realizations: independent realizations with different initial conditions are performed in parallel. Such realizations are uncoupled and therefore do not need to communicate with each other. Each realization communicates with the host to get the initial data and transfers the results at the end of the simulation. Another is to parallelize within a single realization: the domain of state variables is partitioned into smaller sub-domains, and all the threads cooperate with each other while manipulating their own sub-domain in parallel to advance the simulation. We use both methods of parallelization for the fish schooling model. During the parallelization within a single realization, different sub-domains of the fish schooling model are tightly coupled. Even for an application which is 8
9 ideal for the GPU architecture, the user still needs to be careful to deal with the limited shared memory and the large latency of the global memory. The efficient concurrent use of the shared memory and global memory can supply data to the arithmetic unit with a reasonable rate and results in good performance. We will show how our computation for the fish schooling model can be structured to fit within the GPU architecture constraints. 4.1 Parallelization Across Realizations Parallelization across realizations is a straightforward and effective way to improve performance for this application. Using multiple blocks for ensemble simulations keeps the GPU s large arithmetic capacities fully used and hides the shared and global memory access latency to achieve good performance. First, the kernel instructions are distributed among thread blocks, and each of them is in charge of a single realization. The system state for the fish schooling model is defined as Px j i (t), Pyj i (t), V xj i (t), V yj i (t), where Pxj i (t), Pyj i (t) represent the x and y components of the positions of the fish agent, V x j i (t), V yj i (t) represent the x and y components of the velocities of the agent, i is the block id, j is the id of the fish agent, and t represents time. Each thread block with id i = c computes on the subset Px j c (t), Pyj c (t), V xj c (t), V yj c (t) and generates its own initial state variables Px j c(0), Py j c(0), V x j c(0), V y j c(0) using the parallel Mersenne Twister (MT) [7, 9] random number generator. The desired results are stored in the final state vectors Px j i (t f), Py j i (t f), V x j i (t f), V y j i (t f) after t f simulation steps. We use an intermediate data structure on the device to minimize the transfer between the host and the device, and group a few small transfers into a big transfer to reduce the overhead for each transfer. 9
10 4.2 Parallelization Within A Single Realization We briefly review the sequential algorithm before demonstrating our strategy for parallelization across individuals. To begin a simulation, individual agents are placed in a bounded region with random positions and headings. Then, the simulation is advanced for 3000 steps to reach a steady state. By steady state we mean that the group has self-organized and arrived at a particular type of collective motion. During a time step, each agent has to compute the influence from all other agents on itself. To compute the influence on an agent, we first calculate the distance between this goal agent and all other agents, and then use a few variables to record the influence coming from the different zones. Then we calculate the net influences for this goal agent, including a random noise term and save it to an influence array. In a system with N fish agents, the time complexity for the distance calculation on a CPU is O(N N), and for the influence calculation it is O(N). After determining all the influences for all of the agents in the fish school, we update the positions and directions of the agents simultaneously, based on the influence array at the current time step. The on-chip shared memory is very limited but has much lower latency for memory instructions than the global memory. It takes about 2 clock cycles to access data in the shared memory, with an additional clock cycles of memory latency to read or write from the global memory [8]. To effectively make use of the GPU, the on-chip shared memory must be used as much possible. In theory it is better to have at least 128 threads to get the best performance improvement [8], but we also need take the limited shared memory size into account. Thus, there are restrictions on the total number of fish agents stored in the shared memory and the number of threads in each block. The main method of parallelizing within a single realization is to decompose the problem domain into smaller sub-domains to compute in parallel. In our 10
11 100 threads per block Each thread is responsible one individual fish agent 1D Thread Blocks 160 Blocks are used to maximize the use of GPU resources B1T1 B2T1 B160T1 B1T2 B2T2 B160T2 B1T100 B2T100 B160T100 Figure 3: Decomposition of the fish schooling model for CUDA threads and blocks for simulating schools of size N = 100. simulation, the system state vectors Px j i (t), Pyj i (t), V xj i (t), V yj i (t) for each block i are partitioned into subsets to be manipulated by each thread. Assume that there are N fish agents in one realization simulated with n threads. Then we need order of k = N/n time to calculate the influences on each agent. To begin a simulation, all agents are initialized with random positions and directions of travel in parallel. At each time step, n agents are loaded by managing one agent in each thread k times. For schools of size N = 100, we set k = 1 for best performance. When simulating larger schools, one must take k > 1. The decomposition of the schooling model for schools of size N = 100 is shown in Figure 3. The system state variables loaded to shared memory at this stage are called the goal agents. Each thread has only one goal agent held at a time and is responsible for computing the influences on this goal agent. Thus, at one time step, each thread operates on one k-element subvector. To calculate 11
12 the distance between the goal agent and all the other agents, the positions and directions of all the other agents are needed. A temporary array of size n in shared memory is used to load n agents at a time, with each thread loading one agent. The program will keep operating until all the desired data has been loaded and utilized for calculation for all the goal agents if k > 1. Then, each thread can calculate the influence information for its own goal agent. At the last simulation step, each thread records the results into its own influence subvector. Thus the distance calculation complexity is reduced to O(N), and the influence calculation complexity is reduced to O(1). After all the influence calculations are completed, each thread updates its system state subvector with its influence subvector. The process of parallelizing across realizations and within a single realization is shown in Figure 4. The concurrent reads to the same data are supported by the hardware. Thus, during each calculation, each thread can read the full system state vector without conflicts. Grid Block (0) Block (m) Shared Memory Shared Memory (0) (1) (n) (0) (n) Registers Thread (0) Registers Thread (1)... Registers Thread (n)... Registers Thread (0)... Registers Thread (n) Local Memory Local Memory Local Memory Local Memory Local Memory (0,0) (0,1) (0,n) (m,0) (m,n) Global Memory Figure 4: m blocks and n threads are used to parallelize across realizations and within a single realization. First, each thread loads information of one agent as its goal agent from the device memory to shared memory. Second, each thread loads its goal agent to an intermediate array in shared memory. Third, each thread uses the full data in the intermediate array to compute the influences on its own goal agent. The loading continues until all of the influences to the goal agent have been computed. Then, each thread saves its influence to the influence record array (this reuses the previous array to save shared memory). At last, each thread updates information of its goal agent to the system state vector on the device. 12
13 In summary, during the entire simulation most calculations are based on data in the low-latency shared memory and multiple blocks running in parallel hide the memory access latency and maximize the use of the arithmetic units. This results in a big performance improvement. 4.3 Random Number Generation A huge number of high quality pseudorandom numbers are required to generate initial conditions and to add noise to our calculations at each time step. Statistical results can only be trusted if the independence of samples can be guaranteed. Pregenerating a large random number sequence is not a good choice because it consumes too much time for memory accesses. To generate random numbers for our application, we use the Mersenne Twister (MT) [7, 9] algorithm. It has passed many statistical randomness tests including the stringent Diehard tests. It is able to generate high quality, long period random sequences with high order of dimensional equidistribution, and makes efficient use of memory. Thus, MT is perfect for our applications. A modified version of Eric Mills multi threaded C implementation [9] of the MT algorithm was employed to generate the random numbers for each thread in parallel. To generate a large number of random numbers efficiently, the shared memory based implementation is employed. 5 Simulation Results and Performance Our simulations were performed on an NVIDIA GeForce 8800GTX installed on a host workstation with Intel Pentium 3.00GHz CPU and 3.50GB of RAM with physical Address Extension. We explored the effects of varying the ratio of orientation to alignment tendencies r = ω o /ω a for N = 100 member schools. For each r considered,
14 steady-state simulations of the schooling model were performed on the GPU. The remainder of the parameters were set to r r = 1, r p = 7, η = 350, s = 1, τ = 0.2, σ = 0.01, and θ = 115. For r close to zero, attraction tendencies dominate over alignment tendencies and groups exhibit swarming behavior. As r is increased past 1 (equal orientation and alignment), schools become increasingly more polarized, forming highly aligned mobile groups for large r. Group polarization, defined as P(t) = 1 N N ˆv i (t), i=1 serves as a good measure of the mobility of a school. See Figure 5 for a plot of the average group polarization as a function of r. The simulation performance in generating these results was extraordinary. The parallel GPU simulation is about times faster than the corresponding sequential simulation of the same model on the CPU. The polarization curve in Figure 5 took a few minutes to generate on the GPU. The corresponding serial version on the CPU would have taken more than 8 hours. 6 Conclusions We showed how stochastic steady-state simulations of an individual-based fish schooling model can be efficiently performed in parallel on a general-purpose GPU and observed speedups of about times for our parallelized code on the GPU over the corresponding sequential code running on the CPU of the host workstation. With such processing capabilities at our fingertips, it is easy to compare the effects of different modeling assumptions and parameters, allowing us to refine our understanding of how behavior at the population-level arises from individual-level interactions. The GPU has the power to revolutionize the 14
15 1 (c) Average Group Polarization (b) r=2 r= (a) r= r Figure 5: Group polarization as a function of r averaged over 1120 steady-state replicate simulations run in parallel on the GPU. (a) A realization of the swarm state (r = 0.125), (b) a realization of the dynamic parallel state (r = 2), (c) a realization of the highly parallel state (r = 1000). way we do scientific computation by bringing the processing power of a cluster to the home desktop. 7 ACKNOWLEDGMENTS This work was supported in part by the U.S. Department of Energy under DOE award No. DE-FG02-04ER25621, by NIH Grant EB007511, by the Institute for Collaborative Biotechnologies through grants DAAD19-03-D004 from the U.S. Army Research Office, and by National Science Foundation Grant NSF References [1] S. Camazine, J. L. Deneubourg, N. R. Franks, J. Sneyd, G. Theraulaz, and E. Bonabeau. Self-Organization in Biological Systems. Princeton University Press, Princeton,
16 [2] I. D. Couzin, J. Krause, N. R. Franks, and S. A. Levin. Effective leadership and decision making in animal groups on the move. Nature, 433: , [3] GPGPU-Home. GPGPU homepage, [4] H. Li, A. Kolpas, L. Petzold, and J. Moehlis. Parallel simulation for a fish schooling model on a general-purpose graphics processing unit. Concurrency and Computation: Practice and Experience, to appear. [5] H. Li and L. Petzold. Stochastic simulation of biochemical systems on the graphics processing unit. Technical report, Department of Computer Science, University of California, Santa Barbara, Submitted. [6] W. m. Hwu. Gpu computing programming, performance, and scalability. In Proceedings of the Block Island Workshop on Cooperative Control [7] M. Matsumoto and T. Nishimura. Mersenne Twister: a 623-dimensionally equidistributed uniform pseudo-random number generator. ACM Transactions on Modeling and Computer Simulation (TOMACS), 8:3 30, [8] NVIDIA Corporation. NVIDIA CUDA Compute Unified Device Architecture Programming Guide, [9] NVIDIA Forums members. NVIDIA forums, [10] J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Krger, A. E. Lefohn, and T. J. Purcell. A survey of general-purpose computation on graphics hardware. In Eurographics 2005, State of the Art Reports, pages 21 51, Aug
Parallel Simulation for a Fish Schooling Model on a General-Purpose Graphics Processing Unit
CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2007; 00:1 11 [Version: 2002/09/19 v2.02] Parallel Simulation for a Fish Schooling Model on a General-Purpose Graphics
More informationCSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller
Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,
More informationCS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology
CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367
More informationCSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand
More informationIntroduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware
More informationGeneral Purpose GPU Computing in Partial Wave Analysis
JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data
More informationB. Tech. Project Second Stage Report on
B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic
More informationGPUs and GPGPUs. Greg Blanton John T. Lubia
GPUs and GPGPUs Greg Blanton John T. Lubia PROCESSOR ARCHITECTURAL ROADMAP Design CPU Optimized for sequential performance ILP increasingly difficult to extract from instruction stream Control hardware
More informationCS427 Multicore Architecture and Parallel Computing
CS427 Multicore Architecture and Parallel Computing Lecture 6 GPU Architecture Li Jiang 2014/10/9 1 GPU Scaling A quiet revolution and potential build-up Calculation: 936 GFLOPS vs. 102 GFLOPS Memory Bandwidth:
More informationJournal of Universal Computer Science, vol. 14, no. 14 (2008), submitted: 30/9/07, accepted: 30/4/08, appeared: 28/7/08 J.
Journal of Universal Computer Science, vol. 14, no. 14 (2008), 2416-2427 submitted: 30/9/07, accepted: 30/4/08, appeared: 28/7/08 J.UCS Tabu Search on GPU Adam Janiak (Institute of Computer Engineering
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied
More informationPortland State University ECE 588/688. Graphics Processors
Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly
More informationOptimization solutions for the segmented sum algorithmic function
Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code
More informationhigh performance medical reconstruction using stream programming paradigms
high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming
More informationFrom Brook to CUDA. GPU Technology Conference
From Brook to CUDA GPU Technology Conference A 50 Second Tutorial on GPU Programming by Ian Buck Adding two vectors in C is pretty easy for (i=0; i
More informationNumerical Simulation on the GPU
Numerical Simulation on the GPU Roadmap Part 1: GPU architecture and programming concepts Part 2: An introduction to GPU programming using CUDA Part 3: Numerical simulation techniques (grid and particle
More informationPARTICLE SWARM OPTIMIZATION (PSO)
PARTICLE SWARM OPTIMIZATION (PSO) J. Kennedy and R. Eberhart, Particle Swarm Optimization. Proceedings of the Fourth IEEE Int. Conference on Neural Networks, 1995. A population based optimization technique
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied
More informationThreading Hardware in G80
ing Hardware in G80 1 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA 2 3D 3D API: API: OpenGL OpenGL or or Direct3D Direct3D GPU Command &
More informationTesla Architecture, CUDA and Optimization Strategies
Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization
More informationPerformance potential for simulating spin models on GPU
Performance potential for simulating spin models on GPU Martin Weigel Institut für Physik, Johannes-Gutenberg-Universität Mainz, Germany 11th International NTZ-Workshop on New Developments in Computational
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationDIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka
USE OF FOR Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in Prague Mini workshop on advanced numerical methods
More informationUsing GPUs to compute the multilevel summation of electrostatic forces
Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of
More informationEvaluation Of The Performance Of GPU Global Memory Coalescing
Evaluation Of The Performance Of GPU Global Memory Coalescing Dae-Hwan Kim Department of Computer and Information, Suwon Science College, 288 Seja-ro, Jeongnam-myun, Hwaseong-si, Gyeonggi-do, Rep. of Korea
More informationCUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav
CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture
More informationBehavioral Simulations in MapReduce
Behavioral Simulations in MapReduce Guozhang Wang, Cornell University with Marcos Vaz Salles, Benjamin Sowell, Xun Wang, Tuan Cao, Alan Demers, Johannes Gehrke, and Walker White MSR DMX August 20, 2010
More informationGPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27
1 / 27 GPU Programming Lecture 1: Introduction Miaoqing Huang University of Arkansas 2 / 27 Outline Course Introduction GPUs as Parallel Computers Trend and Design Philosophies Programming and Execution
More informationGraphics Processor Acceleration and YOU
Graphics Processor Acceleration and YOU James Phillips Research/gpu/ Goals of Lecture After this talk the audience will: Understand how GPUs differ from CPUs Understand the limits of GPU acceleration Have
More informationGeneral-purpose computing on graphics processing units (GPGPU)
General-purpose computing on graphics processing units (GPGPU) Thomas Ægidiussen Jensen Henrik Anker Rasmussen François Rosé November 1, 2010 Table of Contents Introduction CUDA CUDA Programming Kernels
More informationG P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G
Joined Advanced Student School (JASS) 2009 March 29 - April 7, 2009 St. Petersburg, Russia G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Dmitry Puzyrev St. Petersburg State University Faculty
More informationWHY PARALLEL PROCESSING? (CE-401)
PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:
More informationCurrent Trends in Computer Graphics Hardware
Current Trends in Computer Graphics Hardware Dirk Reiners University of Louisiana Lafayette, LA Quick Introduction Assistant Professor in Computer Science at University of Louisiana, Lafayette (since 2006)
More informationWhat is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms
CS 590: High Performance Computing GPU Architectures and CUDA Concepts/Terms Fengguang Song Department of Computer & Information Science IUPUI What is GPU? Conventional GPUs are used to generate 2D, 3D
More informationCUDA Performance Optimization. Patrick Legresley
CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationCSE 160 Lecture 24. Graphical Processing Units
CSE 160 Lecture 24 Graphical Processing Units Announcements Next week we meet in 1202 on Monday 3/11 only On Weds 3/13 we have a 2 hour session Usual class time at the Rady school final exam review SDSC
More informationLecture 1: Gentle Introduction to GPUs
CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 1: Gentle Introduction to GPUs Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Who Am I? Mohamed
More informationGPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC
GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of
More informationLecture 7: Parallel Processing
Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction
More informationGPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)
GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationComputing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany
Computing on GPUs Prof. Dr. Uli Göhner DYNAmore GmbH Stuttgart, Germany Summary: The increasing power of GPUs has led to the intent to transfer computing load from CPUs to GPUs. A first example has been
More informationVariants of Mersenne Twister Suitable for Graphic Processors
Variants of Mersenne Twister Suitable for Graphic Processors Mutsuo Saito 1, Makoto Matsumoto 2 1 Hiroshima University, 2 University of Tokyo August 16, 2010 This study is granted in part by JSPS Grant-In-Aid
More informationN-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo
N-Body Simulation using CUDA CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo Project plan Develop a program to simulate gravitational
More informationGPU Implementation of a Multiobjective Search Algorithm
Department Informatik Technical Reports / ISSN 29-58 Steffen Limmer, Dietmar Fey, Johannes Jahn GPU Implementation of a Multiobjective Search Algorithm Technical Report CS-2-3 April 2 Please cite as: Steffen
More informationTesla GPU Computing A Revolution in High Performance Computing
Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory
More informationGPGPU. Peter Laurens 1st-year PhD Student, NSC
GPGPU Peter Laurens 1st-year PhD Student, NSC Presentation Overview 1. What is it? 2. What can it do for me? 3. How can I get it to do that? 4. What s the catch? 5. What s the future? What is it? Introducing
More informationIntroduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620
Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved
More informationAdaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics
Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics H. Y. Schive ( 薛熙于 ) Graduate Institute of Physics, National Taiwan University Leung Center for Cosmology and Particle Astrophysics
More informationGPU for HPC. October 2010
GPU for HPC Simone Melchionna Jonas Latt Francis Lapique October 2010 EPFL/ EDMX EPFL/EDMX EPFL/DIT simone.melchionna@epfl.ch jonas.latt@epfl.ch francis.lapique@epfl.ch 1 Moore s law: in the old days,
More informationQR Decomposition on GPUs
QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of
More informationREDUCING BEAMFORMING CALCULATION TIME WITH GPU ACCELERATED ALGORITHMS
BeBeC-2014-08 REDUCING BEAMFORMING CALCULATION TIME WITH GPU ACCELERATED ALGORITHMS Steffen Schmidt GFaI ev Volmerstraße 3, 12489, Berlin, Germany ABSTRACT Beamforming algorithms make high demands on the
More informationSEASHORE / SARUMAN. Short Read Matching using GPU Programming. Tobias Jakobi
SEASHORE SARUMAN Summary 1 / 24 SEASHORE / SARUMAN Short Read Matching using GPU Programming Tobias Jakobi Center for Biotechnology (CeBiTec) Bioinformatics Resource Facility (BRF) Bielefeld University
More informationLecture 7: Parallel Processing
Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction
More informationGPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum
GPU & High Performance Computing (by NVIDIA) CUDA Compute Unified Device Architecture 29.02.2008 Florian Schornbaum GPU Computing Performance In the last few years the GPU has evolved into an absolute
More informationAnalysis Report. Number of Multiprocessors 3 Multiprocessor Clock Rate Concurrent Kernel Max IPC 6 Threads per Warp 32 Global Memory Bandwidth
Analysis Report v3 Duration 932.612 µs Grid Size [ 1024,1,1 ] Block Size [ 1024,1,1 ] Registers/Thread 32 Shared Memory/Block 28 KiB Shared Memory Requested 64 KiB Shared Memory Executed 64 KiB Shared
More informationGPU Computation Strategies & Tricks. Ian Buck NVIDIA
GPU Computation Strategies & Tricks Ian Buck NVIDIA Recent Trends 2 Compute is Cheap parallelism to keep 100s of ALUs per chip busy shading is highly parallel millions of fragments per frame 0.5mm 64-bit
More informationWhy GPUs? Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)
Why GPUs? Robert Strzodka (MPII), Dominik Göddeke G (TUDo( TUDo), Dominik Behr (AMD) Conference on Parallel Processing and Applied Mathematics Wroclaw, Poland, September 13-16, 16, 2009 www.gpgpu.org/ppam2009
More informationPowerVR Hardware. Architecture Overview for Developers
Public Imagination Technologies PowerVR Hardware Public. This publication contains proprietary information which is subject to change without notice and is supplied 'as is' without warranty of any kind.
More informationComplexity and Advanced Algorithms. Introduction to Parallel Algorithms
Complexity and Advanced Algorithms Introduction to Parallel Algorithms Why Parallel Computing? Save time, resources, memory,... Who is using it? Academia Industry Government Individuals? Two practical
More informationAutomatic Scaling Iterative Computations. Aug. 7 th, 2012
Automatic Scaling Iterative Computations Guozhang Wang Cornell University Aug. 7 th, 2012 1 What are Non-Iterative Computations? Non-iterative computation flow Directed Acyclic Examples Batch style analytics
More informationHeight field ambient occlusion using CUDA
Height field ambient occlusion using CUDA 3.6.2009 Outline 1 2 3 4 Theory Kernel 5 Height fields Self occlusion Current methods Marching several directions from each fragment Sampling several times along
More informationLecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1
Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei
More informationAccelerating Molecular Modeling Applications with Graphics Processors
Accelerating Molecular Modeling Applications with Graphics Processors John Stone Theoretical and Computational Biophysics Group University of Illinois at Urbana-Champaign Research/gpu/ SIAM Conference
More informationLecture 1: Introduction and Computational Thinking
PASI Summer School Advanced Algorithmic Techniques for GPUs Lecture 1: Introduction and Computational Thinking 1 Course Objective To master the most commonly used algorithm techniques and computational
More informationBy: Tomer Morad Based on: Erik Lindholm, John Nickolls, Stuart Oberman, John Montrym. NVIDIA TESLA: A UNIFIED GRAPHICS AND COMPUTING ARCHITECTURE In IEEE Micro 28(2), 2008 } } Erik Lindholm, John Nickolls,
More informationCUDA Programming Model
CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming
More informationDistributed Virtual Reality Computation
Jeff Russell 4/15/05 Distributed Virtual Reality Computation Introduction Virtual Reality is generally understood today to mean the combination of digitally generated graphics, sound, and input. The goal
More informationDouble-Precision Matrix Multiply on CUDA
Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices
More informationWarps and Reduction Algorithms
Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum
More informationAdvances in Metaheuristics on GPU
Advances in Metaheuristics on GPU 1 Thé Van Luong, El-Ghazali Talbi and Nouredine Melab DOLPHIN Project Team May 2011 Interests in optimization methods 2 Exact Algorithms Heuristics Branch and X Dynamic
More informationGPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS
GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS Agenda Forming a GPGPU WG 1 st meeting Future meetings Activities Forming a GPGPU WG To raise needs and enhance information sharing A platform for knowledge
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationGPU Programming Using NVIDIA CUDA
GPU Programming Using NVIDIA CUDA Siddhante Nangla 1, Professor Chetna Achar 2 1, 2 MET s Institute of Computer Science, Bandra Mumbai University Abstract: GPGPU or General-Purpose Computing on Graphics
More informationProgrammable Graphics Hardware (GPU) A Primer
Programmable Graphics Hardware (GPU) A Primer Klaus Mueller Stony Brook University Computer Science Department Parallel Computing Explained video Parallel Computing Explained Any questions? Parallelism
More informationCenter for Computational Science
Center for Computational Science Toward GPU-accelerated meshfree fluids simulation using the fast multipole method Lorena A Barba Boston University Department of Mechanical Engineering with: Felipe Cruz,
More informationCME 213 S PRING Eric Darve
CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and
More informationIntegrating GPUs as fast co-processors into the existing parallel FE package FEAST
Integrating GPUs as fast co-processors into the existing parallel FE package FEAST Dipl.-Inform. Dominik Göddeke (dominik.goeddeke@math.uni-dortmund.de) Mathematics III: Applied Mathematics and Numerics
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control
More informationAnalysis-driven Engineering of Comparison-based Sorting Algorithms on GPUs
AlgoPARC Analysis-driven Engineering of Comparison-based Sorting Algorithms on GPUs 32nd ACM International Conference on Supercomputing June 17, 2018 Ben Karsin 1 karsin@hawaii.edu Volker Weichert 2 weichert@cs.uni-frankfurt.de
More informationParallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs
Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs C.-C. Su a, C.-W. Hsieh b, M. R. Smith b, M. C. Jermy c and J.-S. Wu a a Department of Mechanical Engineering, National Chiao Tung
More informationVery fast simulation of nonlinear water waves in very large numerical wave tanks on affordable graphics cards
Very fast simulation of nonlinear water waves in very large numerical wave tanks on affordable graphics cards By Allan P. Engsig-Karup, Morten Gorm Madsen and Stefan L. Glimberg DTU Informatics Workshop
More informationCUDA (Compute Unified Device Architecture)
CUDA (Compute Unified Device Architecture) Mike Bailey History of GPU Performance vs. CPU Performance GFLOPS Source: NVIDIA G80 = GeForce 8800 GTX G71 = GeForce 7900 GTX G70 = GeForce 7800 GTX NV40 = GeForce
More informationAdvanced CUDA Optimization 1. Introduction
Advanced CUDA Optimization 1. Introduction Thomas Bradley Agenda CUDA Review Review of CUDA Architecture Programming & Memory Models Programming Environment Execution Performance Optimization Guidelines
More informationUsing Graphics Chips for General Purpose Computation
White Paper Using Graphics Chips for General Purpose Computation Document Version 0.1 May 12, 2010 442 Northlake Blvd. Altamonte Springs, FL 32701 (407) 262-7100 TABLE OF CONTENTS 1. INTRODUCTION....1
More informationCourse II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan
Course II Parallel Computer Architecture Week 2-3 by Dr. Putu Harry Gunawan www.phg-simulation-laboratory.com Review Review Review Review Review Review Review Review Review Review Review Review Processor
More informationMultiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering
Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to
More informationAlgorithm Engineering with PRAM Algorithms
Algorithm Engineering with PRAM Algorithms Bernard M.E. Moret moret@cs.unm.edu Department of Computer Science University of New Mexico Albuquerque, NM 87131 Rome School on Alg. Eng. p.1/29 Measuring and
More informationParallelism and Concurrency. COS 326 David Walker Princeton University
Parallelism and Concurrency COS 326 David Walker Princeton University Parallelism What is it? Today's technology trends. How can we take advantage of it? Why is it so much harder to program? Some preliminary
More informationGPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA
GPGPU LAB Case study: Finite-Difference Time- Domain Method on CUDA Ana Balevic IPVS 1 Finite-Difference Time-Domain Method Numerical computation of solutions to partial differential equations Explicit
More informationMotivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism
Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the
More informationGraphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics
Why GPU? Chapter 1 Graphics Hardware Graphics Processing Unit (GPU) is a Subsidiary hardware With massively multi-threaded many-core Dedicated to 2D and 3D graphics Special purpose low functionality, high
More informationGeneral Purpose Computing on Graphical Processing Units (GPGPU(
General Purpose Computing on Graphical Processing Units (GPGPU( / GPGP /GP 2 ) By Simon J.K. Pedersen Aalborg University, Oct 2008 VGIS, Readings Course Presentation no. 7 Presentation Outline Part 1:
More informationGPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP
GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP INTRODUCTION or With the exponential increase in computational power of todays hardware, the complexity of the problem
More informationCUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni
CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3
More informationBenchmark 1.a Investigate and Understand Designated Lab Techniques The student will investigate and understand designated lab techniques.
I. Course Title Parallel Computing 2 II. Course Description Students study parallel programming and visualization in a variety of contexts with an emphasis on underlying and experimental technologies.
More informationCS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in
More information