CS267 Report Particle Simulation on a GPU with PyCUDA

Size: px
Start display at page:

Download "CS267 Report Particle Simulation on a GPU with PyCUDA"

Transcription

1 CS267 Report Particle Simulation on a GPU with PyCUDA Min Ragan-Kelley minrk@berkeley.edu May 12, Introduction This report is on a small test problem within the context of a larger long-term research project. GPUs are increasingly popular for particle methods, due to the readily apparent parallelism inherent to N-Body problems. Particle-In-Cell is a popular scheme for exploring systems in plasma physics. We hope to explore a small sample problem in order to gauge the viability of using PyCUDA to run interactive-scale particle simulations on a GPU for use in our research. 1.1 Project Context The Plasma Simulation and Theory Group (PTSG) at UC Berkeley maintains a large Object- Oriented Particle-In-Cell (OOPIC) code, largely built in the mid 1990s [1]. It is primarily a 2D system, and is a serial code. Applications of OOPC range from small interactive simulations, running many timesteps per second on a laptop, to large offline simulations up to days or weeks in runtime. The goal of the object oriented design is to facilitate continued development of new physics behaviors to simulations over the lifetime of the code, and has been quite successful, and OOPIC continues to enjoy active use and development. OOPIC s shortcomings in the present landscape are twofold. First, the User Interface (UI) is mouse-only button based input. This makes programmatic interaction with the simulation impossible. The diagnostics available are also static - if a user would like to add a Derivative Diagnostic (for example: given diagnostics J and B, add J B ), their only option is to add it to the C++ source code and recompile the program. The second major shortcoming is performance. OOPIC is a fast serial code, but parallelism was not considered in the original design. Basic parallel functionality has been implemented with MPI, but does not scale beyond a few processors. The goal of this project is to build a system with similar extensibility to existing OOPIC, but solve its parallel performance and interface problems. Our plan to accomplish this is by writing the application logic in Python, with the performance components as native C or CUDA kernels. Adding new physics capabilities should be at least as well facilitated as in current OOPIC by 1

2 a modular design. Having the interface itself be Python, we get a very powerful programmatic interface and user environment. Having NumPy arrays as the native data structure for diagnostics allows us to use existing plotting libraries such as matplotlib and Chaco, without having to write our own plotting system [5] [7] [8]. NumPy arrays are also the native data structure of most data analysis tools in Python, so our users gain access to all the powerful facilities of tools such as SciPy for free, just by our decision to present diagnostics as NumPy arrays [6]. Hierarchical parallelism for multi-device simulations will be a part of the design from the beginning. Spatial Decomposition occurs at the device level, and each Spatial Region communicates with others via MPI or Shared Memory in Host Code. Each device has its own Python process, issuing control and diagnostic commands, and the Python processes communicate via the IPython parallel computing kernel, and the native code backends will also be able to communicate with each other directly, via MPI. A parent Controller process is the User s interface to the whole simulation, which merges the diagnostic information from the various components, as seen in Figure 1. Our ultimate goal is to be able to utilize a system with: 1 machines on a network, where each machine has 1 CPU core and 1 compute capable GPU. Our short term plan, however, is just to explore the use of a single GPU. IPython Controller Control, Relay Merged Diagnostics IPython Kernel Control Derivative Diagnostics MPI C Sim Backend Native Diagnostics Figure 1: Schematic of the connections in the system. Python processes communicate control commands with each other and the native backends, while the backends communicate directly with each other using high performance systems, such as MPI. 1.2 PyCUDA PyCUDA is a module for Python that exposes the CUDA drivers to Python via the Boost library[4]. It has a variety of sophisticated functionalities. At its most basic, it works as a translation layer, allowing Python code to launch CUDA kernels and transfer data to and from CUDA enabled devices. In our case, we wrote a kernel fully in C+CUDA, setup the simulation with Python, and launched the CUDA kernel with the Python defined data via PyCUDA. 2

3 1.3 Test Problem The scope of this project is a small sample simulation, to test the feasibility of using PyCUDA as a backend for particle simulations. The model to be simulated is a combination of that explored in CS267 HW2 and the N-Body simulation found in NVIDIA s GPU Gems 3 and the CUDA SDK [2] [3]. The Gems simulation is a 3D all-pairs N-Body gravitational (least-squares attractive) simulation with free boundary conditions. We adapted this to be more like the simulation in HW2. The simulation is a 2D least-squares repulsive (Coulomb) simulation, where interactions beyond a certain distance are approximated as zero (short-range force approximation). The boundary conditions are specular (elastic) reflection off the x boundaries, and periodic boundary conditions in y. Particles are given random (uniformly distributed) initial velocity in the +y direction, and random (uniformly distributed) velocity in ±x, 1 order of magnitude smaller than the +y velocity. v o = (v rand( 1, 1) )ˆx + (v rand(0, 1))ŷ (1) 10 Particles are randomly distributed in the region, centered in x, filling half the width of the container, and the full height. r o = ( X 4 rand(1, 3))ˆx + (Y rand(0, 1))ŷ (2) where X ranges from Y/10 to Y/2. Figure 2: Schematic of the simulation. As particles leave the top, they wrap around the bottom. The resulting simulation approximates a sheet beam of electrons under free expansion, infinite in y and z, see Figure 2. 2 Implementation Starting with the Gems N-Body kernel, our major adaptations were to account for the short-range approximation, reduce the dimension to 2D, and apply boundary conditions. The short-range interaction means that most interactions are zero. To deal with this, the model, as in HW2, was to 3

4 // r i j [ 2 FLOPS] r. x = bj. x bi. x ; r. y = bj. y bi. y ; // d i s t S q r = dot ( r i j, r i j ) + \ e p s i l o n ˆ2 [ 4 FLOPS] T distsqr = r. x * r. x + r. y * r. y ; distsqr += getsofteningsquared<t >() ; i f ( distsqr > cutoff ) { // branch // invdistcube =1/ d i s t S q r ˆ ( 3 / 2 ) [ 4 FLOPS (2 mul, 1 sqrt, 1 inv ) ] T invdist = rsqrt_t ( distsqr ) ; T invdistcube = invdist * invdist * invdist ; // s = m j * invdistcube [ 1 FLOP] T s = bj. w * invdistcube ; } // a i = a i + s * r i j [ 4 FLOPS] ai. x += r. x * s ; ai. y += r. y * s ; break up the simulation into subregions. All interactions within a subregion are computed. Each subregion also contains ghost particles from neighboring subregions within a cutoff distance. In 2D, it takes 15 FLOPS to compute a non-zero interaction, and 6 flops to compute a zero interaction, as seen in the interaction routine above. 2.1 Subregions For simplicity, we only implemented 1D subdivisions, so each subregion fills the width of the system for a slice in y. The model computes a nonzero interaction (15 FLOPS) for every particle within the cutoff range, and a zero interaction (6 FLOPS) for every particle in the subdomain that is not within the cutoff range. By analyzing the area of the cutoff radius versus the subdomain, we can see approximately how much work we waste. Y A sub = X ( + 2r c ) n subs (3) A nonzero = πrc 2 (4) real work 15 A sub (5) wasted effort 6(A sub A nonzero ) (6) It would make sense to have square subdomains only slightly larger than the cutoff radius - this would cut down on the wasted effort a great deal. However, the bookkeeping involved in such a scheme requires a great deal of if-tests and memory access, which is much more expensive relative to computation on a GPU than on a CPU. So we decided to waste FLOPS rather than spend extra time and memory bookkeeping, as is common practice when programming GPUs. We intended to quantify the performance difference, but did not succeed in implementing the latter method in 4

5 Figure 3: Diagram showing cutoff radius of the red particle near the edge of a subdomain, including the ghost particles. The ghost region is the dotted line, and particles outside the dotted line reside in a different subregion, and are not considered for interactions, since they are guaranteed to be outside r c. total work regions 4 regions 16 regions 64 regions 256 regions Y/r c Figure 4: Total work computing interactions (including zeros) versus the size of the cutoff distance relative to the simulated system. The dotted line is the amount of work spent calculating non-zero interactions. time. 5

6 The scheme can be summarized as follows: f o r each subdomain S : # parallel ( rows in grid ) f o r each particle p in S excluding ghost particles : # parallel ( threads / thread blocks ) f o r each particle q in S including ghost particles : interact ( p, q ) advance ( p ) i f p will leave : add p to send queues send / recv new particles send / recv ghost particles 2.2 Threads and Blocks The kernel is run with a number of CUDA thread blocks that is evenly divisible by the number of subdivision. That way, each subdivision gets the same number of thread blocks. It is a single particle kernel, and the number of blocks is arranged such that there is a maximum number of threads (and thus particles) in a given subdomain. For the relative uniformity of this simulation, simply using a safety factor of 1.5, that is, allowing for 1.5x the number of particles expected by the average density. This scheme of thread allocation is not appropriate for nonuniform workloads, but is adequate for the test problem. 3 Performance We ran our simulation in single precision on NVIDIA GT200 GPUs (GTX-260 and Tesla C1060). We found that in order to minimize wasted computation, choosing the subregion height to be 2x of the cutoff distance, and our plots all use a cutoff distance r c = Y/256 and X = Y/10. 6

7 GFLOPS C1060: 256 tpb GTX 260: 256 tpb C1060: 64 tpb GTX 260: 64 tpb C1060: 32 tpb GTX 260: 32 tpb N Figure 5: Performance for various arrangements of threads per block (tpb). Note that 32 tbp is slower than larger numbers, but on average 64 tpb and 256 tbp are very similar. time C1060: 256 tpb GTX 260: 256 tpb C1060: 64 tpb GTX 260: 64 tpb C1060: 32 tpb GTX 260: 32 tpb N Figure 6: Runtime (normalized) for various configurations. Note the step function for 256 threads per block. 3.1 Periodicity In the plot of performance in Figure 5, there is an interesting periodic structure. By looking at a closeup of time vs N in Figures 6 and 7, we see that the runtime is close to a step function for larger thread blocks. The length of the steps appears to be proportional to (and slightly greater than) the number of threads per block * number of SMs on the chip. This makes sense, because if the GPU is fully utilized, it is running one thread block on each SM at a time. If the number of blocks is evenly divisible by the number of SMs, then the GPU will be near full load for the whole simulation. However, if N BLOCKs % N SMs = 1, then the GPU will spend an entire round with only one SM occupied. Further, adding blocks after that will be able to be run concurrently with 7

8 time C1060: 256 tpb GTX 260: 256 tpb N Figure 7: Closeup of the step function in time for 256 threads per block. Note that the period is 10% shorter for the 260, corresponding to having 10% fewer SMs. the first new block and not significantly affect the time until the pass, as illustrated in Figure 8. This results in a timing function to describe the run time of the simulation. t = s t 0 ((B + SM 1)/SM) (7) where t 0 is the typical time it takes to evaluate a block, s is the number of steps to evolve, B is the number of Thread Blocks, SM is the number of SMs on the GPU, and the / represents integer division. Further, B = 1.5N/tpb, rounded up for even divisibility among subregions. Noting that the wasted time is a function of the size of a thread block, there is an incentive to keep the number of threads per block low. However, the overhead of switching puts a lower limit on that number. Running with 64 threads per block gets approximately the same average performance as 256, but with a much smoother line. 8

9 t B%SM = SM Figure 8: Illustration of timing for numbers of thread blocks. B is the number of thread blocks, SM is the number of SMs on the GPU. SM=27 on the 260, and SM=30 on the Tesla board. 3.2 Overhead There is overhead associated with reducing the number of nonzero interactions. Each timestep, particles must be communicated between subregions, and it takes significant work to identify and move those particles. However, since it amounts simply to moving a particle in device memory, the penalty is small. If the workload were not so evenly distributed, there would potentially be a large penalty for having a significant number of idle threads. 3.3 Divergence The if statement in our kernel causes some threads to execute the zero-interaction path, and some to evaluate the full interaction. In CUDA, branch divergence causes a serious performance hit. Branch divergence is per-warp (set of 32 threads), so as long as within a given warp every thread takes the same branch, there should be no suffering. However, we were unable to fully avoid branch divergence, which cost a great deal of performance. To illustrate this, setting the cutoff to be larger than a subregion as in Figure 9, such that the if statement is always True, We see a performance increase of approximately 5%, in Figure 10. Of course, this simulation may be faster, 9

10 but it no longer makes sense, because the model requires that all particles within the cutoff range are included in the force calculation, and only particles within the subregion are actually used. This results in an error where the force calculated on particles closer to the subregion edge than the cutoff is more strongly directed towards the outside of the subregion than if the model were properly calculated. Figure 9: Diagram showing the cutoff radius larger than the subregion, used to eliminate divergence. Particles outside the box should be included in the model, but are outside the search space, which includes only the subdomain. GFLOPS Divergent 256 tpb Convergent 256 tpb Divergent 64 tpb Convergent 64 tpb N Figure 10: Performance for the divergent and invalid convergent codes. The divergence penalty appears to be about 5%. The divergence also appears to introduce some variability into the performance, as the case without divergence seems smoother. 10

11 4 Conclusions and Future Plans We managed to utilize 23% of the theoretical 1TFlop performance of the Tesla GPU 1. When compared to the original GPU Gems implementation, which does a complete all-pairs simulation, that gets closer to ( 50%) of peak. Clearly there is room for improvement, but since our scheme only computes N 2 /128 interactions, a factor of two performance shortage versus a full N 2 model is quite acceptable. In our implementation, we made the assumption that the overhead of a more thorough bookkeeping method would overwhelm the wasted computation of evaluating many zero interactions. It would be both interesting and prudent to explore such a scheme in order to examine the validity of this assumption. Even taking the same 1D scheme we have, we could make the slices across x instead of across y. This would result in dramatically reducing the amount of particle communication, since particles travel predominantly in the y direction, but it could negatively impact the load balancing, since the particles are not uniformly distributed in x. PyCUDA was not integral to performance of the code, nor did it hinder it noticeably. The kernel was handwritten in C+CUDA, adapted from the Gems source available from NVIDIA, and PyCUDA was simply used to load up the simulation and launch the kernel. In this way, PyCUDA allowed for the simple part of the code to be trivial by letting it be written in Python, and the hard code to be fast by running the raw CUDA kernel. Our original goal was to compare this simple method to one or two more sophisticated methods, but this implementation took longer to implement than anticipated. We hope to explore various schemes further, particularly ones more suitable to plasma simulations [9] [10]. Over the next months (and longer), we plan to build a more complete PIC and tree-code simulation on the GPU with PyCUDA to form the starting backend for our broader Python PIC project. There were also plenty of iterations of this code not discussed here, which, running afoul of coalesced and aligned memory access, defaulting double precision constants, and improper synchronization, managed to be far slower than a naive serial N 2 implementation. The general conclusion of the experience is that it is manageable to make a problem reasonably fast on a GPU, but it is very easy to make it very slow. References [1] J.P. Verboncoeur, A.B. Langdon and N.T. Gladd, An Object-Oriented Electromagnetic PIC Code, Comp. Phys. Comm., 87, [2] L. Nyland, M. Harris, J. Prins, GPU Gems 3: Chapter 31. Fast N-Body Simulation with CUDA, ch31.html [3] CUDA SDKhttp:// get.html 1 in case you remember a different number from my poster, I had made a favorable miscalculation there, giving closer to 30% of peak. 11

12 [4] A. Klöckner, PyCUDA, [5] NumPy, [6] SciPy, [7] matplotlib, [8] Chaco, [9] J. Barnes, and P. Hut. A Hierarchical O(n log n) Force Calculation Algorithm. Nature 324, [10] L. Greengard, et. al. A New Version of the Fast Multipole Method for Screened Coulomb Interactions in Three Dimensions, JCP 180,

Double-Precision Matrix Multiply on CUDA

Double-Precision Matrix Multiply on CUDA Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

Acceleration of a Python-based Tsunami Modelling Application via CUDA and OpenHMPP

Acceleration of a Python-based Tsunami Modelling Application via CUDA and OpenHMPP Acceleration of a Python-based Tsunami Modelling Application via CUDA and OpenHMPP Zhe Weng and Peter Strazdins*, Computer Systems Group, Research School of Computer Science, The Australian National University

More information

KSTAR tokamak. /

KSTAR tokamak. / KSTAR tokamak / spinhalf@nfri.re.kr !!! Data parallelism CUDA programming python! pycuda GPU Development tools Python 2.6+ Scientific libraries as python package for interactive computing (Numpy, Scipy..)

More information

COMPARISON OF PYTHON 3 SINGLE-GPU PARALLELIZATION TECHNOLOGIES ON THE EXAMPLE OF A CHARGED PARTICLES DYNAMICS SIMULATION PROBLEM

COMPARISON OF PYTHON 3 SINGLE-GPU PARALLELIZATION TECHNOLOGIES ON THE EXAMPLE OF A CHARGED PARTICLES DYNAMICS SIMULATION PROBLEM COMPARISON OF PYTHON 3 SINGLE-GPU PARALLELIZATION TECHNOLOGIES ON THE EXAMPLE OF A CHARGED PARTICLES DYNAMICS SIMULATION PROBLEM A. Boytsov 1,a, I. Kadochnikov 1,4,b, M. Zuev 1,c, A. Bulychev 2,d, Ya.

More information

Dynamic load balancing in OSIRIS

Dynamic load balancing in OSIRIS Dynamic load balancing in OSIRIS R. A. Fonseca 1,2 1 GoLP/IPFN, Instituto Superior Técnico, Lisboa, Portugal 2 DCTI, ISCTE-Instituto Universitário de Lisboa, Portugal Maintaining parallel load balance

More information

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh GPU Performance Optimisation EPCC The University of Edinburgh Hardware NVIDIA accelerated system: Memory Memory GPU vs CPU: Theoretical Peak capabilities NVIDIA Fermi AMD Magny-Cours (6172) Cores 448 (1.15GHz)

More information

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm. Martin Burtscher Department of Computer Science Texas State University-San Marcos

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm. Martin Burtscher Department of Computer Science Texas State University-San Marcos An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm Martin Burtscher Department of Computer Science Texas State University-San Marcos Mapping Regular Code to GPUs Regular codes Operate on

More information

MD-CUDA. Presented by Wes Toland Syed Nabeel

MD-CUDA. Presented by Wes Toland Syed Nabeel MD-CUDA Presented by Wes Toland Syed Nabeel 1 Outline Objectives Project Organization CPU GPU GPGPU CUDA N-body problem MD on CUDA Evaluation Future Work 2 Objectives Understand molecular dynamics (MD)

More information

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA GPGPU LAB Case study: Finite-Difference Time- Domain Method on CUDA Ana Balevic IPVS 1 Finite-Difference Time-Domain Method Numerical computation of solutions to partial differential equations Explicit

More information

CUDA Particles. Simon Green

CUDA Particles. Simon Green CUDA Particles Simon Green sdkfeedback@nvidia.com Document Change History Version Date Responsible Reason for Change 1.0 Sept 19 2007 Simon Green Initial draft Abstract Particle systems [1] are a commonly

More information

B. Tech. Project Second Stage Report on

B. Tech. Project Second Stage Report on B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic

More information

Code Optimizations for High Performance GPU Computing

Code Optimizations for High Performance GPU Computing Code Optimizations for High Performance GPU Computing Yi Yang and Huiyang Zhou Department of Electrical and Computer Engineering North Carolina State University 1 Question to answer Given a task to accelerate

More information

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions. Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication

More information

Optimisation Myths and Facts as Seen in Statistical Physics

Optimisation Myths and Facts as Seen in Statistical Physics Optimisation Myths and Facts as Seen in Statistical Physics Massimo Bernaschi Institute for Applied Computing National Research Council & Computer Science Department University La Sapienza Rome - ITALY

More information

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Islam Harb 08/21/2015 Agenda Motivation Research Challenges Contributions & Approach Results Conclusion Future Work 2

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA 2006: Short-Range Molecular Dynamics on GPU San Jose, CA September 22, 2010 Peng Wang, NVIDIA Overview The LAMMPS molecular dynamics (MD) code Cell-list generation and force calculation Algorithm & performance

More information

Parallel Implementation of 3D FMA using MPI

Parallel Implementation of 3D FMA using MPI Parallel Implementation of 3D FMA using MPI Eric Jui-Lin Lu y and Daniel I. Okunbor z Computer Science Department University of Missouri - Rolla Rolla, MO 65401 Abstract The simulation of N-body system

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

Using GPUs to compute the multilevel summation of electrostatic forces

Using GPUs to compute the multilevel summation of electrostatic forces Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware

More information

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3

More information

Outline. L9: Project Discussion and Floating Point Issues. Project Parts (Total = 50%) Project Proposal (due 3/8) 2/13/12.

Outline. L9: Project Discussion and Floating Point Issues. Project Parts (Total = 50%) Project Proposal (due 3/8) 2/13/12. Outline L9: Project Discussion and Floating Point Issues Discussion of semester projects Floating point Mostly single precision until recent architectures Accuracy What s fast and what s not Reading: Ch

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

Dense Linear Algebra. HPC - Algorithms and Applications

Dense Linear Algebra. HPC - Algorithms and Applications Dense Linear Algebra HPC - Algorithms and Applications Alexander Pöppl Technical University of Munich Chair of Scientific Computing November 6 th 2017 Last Tutorial CUDA Architecture thread hierarchy:

More information

Maximizing Face Detection Performance

Maximizing Face Detection Performance Maximizing Face Detection Performance Paulius Micikevicius Developer Technology Engineer, NVIDIA GTC 2015 1 Outline Very brief review of cascaded-classifiers Parallelization choices Reducing the amount

More information

Abstract. Introduction. Kevin Todisco

Abstract. Introduction. Kevin Todisco - Kevin Todisco Figure 1: A large scale example of the simulation. The leftmost image shows the beginning of the test case, and shows how the fluid refracts the environment around it. The middle image

More information

Exploiting graphical processing units for data-parallel scientific applications

Exploiting graphical processing units for data-parallel scientific applications CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2009; 21:2400 2437 Published online 20 July 2009 in Wiley InterScience (www.interscience.wiley.com)..1462 Exploiting

More information

Stream Processing with CUDA TM A Case Study Using Gamebryo's Floodgate Technology

Stream Processing with CUDA TM A Case Study Using Gamebryo's Floodgate Technology Stream Processing with CUDA TM A Case Study Using Gamebryo's Floodgate Technology Dan Amerson, Technical Director, Emergent Game Technologies Purpose Why am I giving this talk? To answer this question:

More information

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0 Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nsight VSE APOD

More information

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as

More information

GPU Implementation of a Multiobjective Search Algorithm

GPU Implementation of a Multiobjective Search Algorithm Department Informatik Technical Reports / ISSN 29-58 Steffen Limmer, Dietmar Fey, Johannes Jahn GPU Implementation of a Multiobjective Search Algorithm Technical Report CS-2-3 April 2 Please cite as: Steffen

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

Performance potential for simulating spin models on GPU

Performance potential for simulating spin models on GPU Performance potential for simulating spin models on GPU Martin Weigel Institut für Physik, Johannes-Gutenberg-Universität Mainz, Germany 11th International NTZ-Workshop on New Developments in Computational

More information

1/25/12. Administrative

1/25/12. Administrative Administrative L3: Memory Hierarchy Optimization I, Locality and Data Placement Next assignment due Friday, 5 PM Use handin program on CADE machines handin CS6235 lab1 TA: Preethi Kotari - Email:

More information

University of Bielefeld

University of Bielefeld Geistes-, Natur-, Sozial- und Technikwissenschaften gemeinsam unter einem Dach Introduction to GPU Programming using CUDA Olaf Kaczmarek University of Bielefeld STRONGnet Summerschool 2011 ZIF Bielefeld

More information

Intro to Parallel Computing

Intro to Parallel Computing Outline Intro to Parallel Computing Remi Lehe Lawrence Berkeley National Laboratory Modern parallel architectures Parallelization between nodes: MPI Parallelization within one node: OpenMP Why use parallel

More information

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering

More information

Tuning CUDA Applications for Fermi. Version 1.2

Tuning CUDA Applications for Fermi. Version 1.2 Tuning CUDA Applications for Fermi Version 1.2 7/21/2010 Next-Generation CUDA Compute Architecture Fermi is NVIDIA s next-generation CUDA compute architecture. The Fermi whitepaper [1] gives a detailed

More information

Accelerating Molecular Modeling Applications with Graphics Processors

Accelerating Molecular Modeling Applications with Graphics Processors Accelerating Molecular Modeling Applications with Graphics Processors John Stone Theoretical and Computational Biophysics Group University of Illinois at Urbana-Champaign Research/gpu/ SIAM Conference

More information

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as

More information

Technical Documentation Version 7.4. Performance

Technical Documentation Version 7.4. Performance Technical Documentation Version 7.4 These documents are copyrighted by the Regents of the University of Colorado. No part of this document may be reproduced, stored in a retrieval system, or transmitted

More information

Persistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL

Persistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL (stashing recurrent weights on-chip) Baidu SVAIL April 7, 2016 SVAIL Think hard AI. Goal Develop hard AI technologies that impact 100 million users. Deep Learning at SVAIL 100 GFLOP/s 1 laptop 6 TFLOP/s

More information

Shallow Water Simulations on Graphics Hardware

Shallow Water Simulations on Graphics Hardware Shallow Water Simulations on Graphics Hardware Ph.D. Thesis Presentation 2014-06-27 Martin Lilleeng Sætra Outline Introduction Parallel Computing and the GPU Simulating Shallow Water Flow Topics of Thesis

More information

GPU Programming Using NVIDIA CUDA

GPU Programming Using NVIDIA CUDA GPU Programming Using NVIDIA CUDA Siddhante Nangla 1, Professor Chetna Achar 2 1, 2 MET s Institute of Computer Science, Bandra Mumbai University Abstract: GPGPU or General-Purpose Computing on Graphics

More information

Parallel Compact Roadmap Construction of 3D Virtual Environments on the GPU

Parallel Compact Roadmap Construction of 3D Virtual Environments on the GPU Parallel Compact Roadmap Construction of 3D Virtual Environments on the GPU Avi Bleiweiss NVIDIA Corporation Programmability GPU Computing CUDA C++ Parallel Debug Heterogeneous Computing Productivity Efficiency

More information

Height field ambient occlusion using CUDA

Height field ambient occlusion using CUDA Height field ambient occlusion using CUDA 3.6.2009 Outline 1 2 3 4 Theory Kernel 5 Height fields Self occlusion Current methods Marching several directions from each fragment Sampling several times along

More information

A4. Intro to Parallel Computing

A4. Intro to Parallel Computing Self-Consistent Simulations of Beam and Plasma Systems Steven M. Lund, Jean-Luc Vay, Rémi Lehe and Daniel Winklehner Colorado State U., Ft. Collins, CO, 13-17 June, 2016 A4. Intro to Parallel Computing

More information

Running the NIM Next-Generation Weather Model on GPUs

Running the NIM Next-Generation Weather Model on GPUs Running the NIM Next-Generation Weather Model on GPUs M.Govett 1, J.Middlecoff 2 and T.Henderson 2 1 NOAA Earth System Research Laboratory, Boulder, CO, USA 2 Cooperative Institute for Research in the

More information

Dense matching GPU implementation

Dense matching GPU implementation Dense matching GPU implementation Author: Hailong Fu. Supervisor: Prof. Dr.-Ing. Norbert Haala, Dipl. -Ing. Mathias Rothermel. Universität Stuttgart 1. Introduction Correspondence problem is an important

More information

Accelerated Machine Learning Algorithms in Python

Accelerated Machine Learning Algorithms in Python Accelerated Machine Learning Algorithms in Python Patrick Reilly, Leiming Yu, David Kaeli reilly.pa@husky.neu.edu Northeastern University Computer Architecture Research Lab Outline Motivation and Goals

More information

CUDA Programming Model

CUDA Programming Model CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming

More information

Numerical Simulation on the GPU

Numerical Simulation on the GPU Numerical Simulation on the GPU Roadmap Part 1: GPU architecture and programming concepts Part 2: An introduction to GPU programming using CUDA Part 3: Numerical simulation techniques (grid and particle

More information

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation

More information

Matrix Multiplication in CUDA. A case study

Matrix Multiplication in CUDA. A case study Matrix Multiplication in CUDA A case study 1 Matrix Multiplication: A Case Study Matrix multiplication illustrates many of the basic features of memory and thread management in CUDA Usage of thread/block

More information

Molecular Dynamics Simulations with Julia

Molecular Dynamics Simulations with Julia Emily Crabb 6.338/18.337 Final Project Molecular Dynamics Simulations with Julia I. Project Overview This project consists of one serial and several parallel versions of a molecular dynamics simulation

More information

Introduction to CUDA

Introduction to CUDA Introduction to CUDA Overview HW computational power Graphics API vs. CUDA CUDA glossary Memory model, HW implementation, execution Performance guidelines CUDA compiler C/C++ Language extensions Limitations

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

Interval arithmetic on graphics processing units

Interval arithmetic on graphics processing units Interval arithmetic on graphics processing units Sylvain Collange*, Jorge Flórez** and David Defour* RNC'8 July 7 9, 2008 * ELIAUS, Université de Perpignan Via Domitia ** GILab, Universitat de Girona How

More information

Review: Creating a Parallel Program. Programming for Performance

Review: Creating a Parallel Program. Programming for Performance Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)

More information

Asynchronous OpenCL/MPI numerical simulations of conservation laws

Asynchronous OpenCL/MPI numerical simulations of conservation laws Asynchronous OpenCL/MPI numerical simulations of conservation laws Philippe HELLUY 1,3, Thomas STRUB 2. 1 IRMA, Université de Strasbourg, 2 AxesSim, 3 Inria Tonus, France IWOCL 2015, Stanford Conservation

More information

Overview. Videos are everywhere. But can take up large amounts of resources. Exploit redundancy to reduce file size

Overview. Videos are everywhere. But can take up large amounts of resources. Exploit redundancy to reduce file size Overview Videos are everywhere But can take up large amounts of resources Disk space Memory Network bandwidth Exploit redundancy to reduce file size Spatial Temporal General lossless compression Huffman

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

arxiv: v1 [cs.dc] 24 Feb 2010

arxiv: v1 [cs.dc] 24 Feb 2010 Deterministic Sample Sort For GPUs arxiv:1002.4464v1 [cs.dc] 24 Feb 2010 Frank Dehne School of Computer Science Carleton University Ottawa, Canada K1S 5B6 frank@dehne.net http://www.dehne.net February

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

Highly Efficient Compensationbased Parallelism for Wavefront Loops on GPUs

Highly Efficient Compensationbased Parallelism for Wavefront Loops on GPUs Highly Efficient Compensationbased Parallelism for Wavefront Loops on GPUs Kaixi Hou, Hao Wang, Wu chun Feng {kaixihou, hwang121, wfeng}@vt.edu Jeffrey S. Vetter, Seyong Lee vetter@computer.org, lees2@ornl.gov

More information

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose Synchronization Ideal case for parallelism: no resources shared between threads no communication between threads Many

More information

Performance impact of dynamic parallelism on different clustering algorithms

Performance impact of dynamic parallelism on different clustering algorithms Performance impact of dynamic parallelism on different clustering algorithms Jeffrey DiMarco and Michela Taufer Computer and Information Sciences, University of Delaware E-mail: jdimarco@udel.edu, taufer@udel.edu

More information

Dynamic Load Balancing on Single- and Multi-GPU Systems

Dynamic Load Balancing on Single- and Multi-GPU Systems Dynamic Load Balancing on Single- and Multi-GPU Systems Long Chen, Oreste Villa, Sriram Krishnamoorthy, Guang R. Gao Department of Electrical & Computer Engineering High Performance Computing University

More information

CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. Stephen Jones, GTC 2017

CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. Stephen Jones, GTC 2017 CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES Stephen Jones, GTC 2017 The art of doing more with less 2 Performance RULE #1: DON T TRY TOO HARD Peak Performance Time 3 Unrealistic Effort/Reward Performance

More information

Bulk Synchronous and SPMD Programming. The Bulk Synchronous Model. CS315B Lecture 2. Bulk Synchronous Model. The Machine. A model

Bulk Synchronous and SPMD Programming. The Bulk Synchronous Model. CS315B Lecture 2. Bulk Synchronous Model. The Machine. A model Bulk Synchronous and SPMD Programming The Bulk Synchronous Model CS315B Lecture 2 Prof. Aiken CS 315B Lecture 2 1 Prof. Aiken CS 315B Lecture 2 2 Bulk Synchronous Model The Machine A model An idealized

More information

CUDA Particles. Simon Green

CUDA Particles. Simon Green CUDA Particles Simon Green sdkfeedback@nvidia.com Document Change History Version Date Responsible Reason for Change 1.0 Sept 19 2007 Simon Green Initial draft 1.1 Nov 3 2007 Simon Green Fixed some mistakes,

More information

Disk Scheduling COMPSCI 386

Disk Scheduling COMPSCI 386 Disk Scheduling COMPSCI 386 Topics Disk Structure (9.1 9.2) Disk Scheduling (9.4) Allocation Methods (11.4) Free Space Management (11.5) Hard Disk Platter diameter ranges from 1.8 to 3.5 inches. Both sides

More information

GPU Fundamentals Jeff Larkin November 14, 2016

GPU Fundamentals Jeff Larkin November 14, 2016 GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate

More information

Unrolling parallel loops

Unrolling parallel loops Unrolling parallel loops Vasily Volkov UC Berkeley November 14, 2011 1 Today Very simple optimization technique Closely resembles loop unrolling Widely used in high performance codes 2 Mapping to GPU:

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

Modelling Multi-GPU Systems 1

Modelling Multi-GPU Systems 1 Modelling Multi-GPU Systems 1 Daniele G. SPAMPINATO a, Anne C. ELSTER a and Thorvald NATVIG a a Norwegian University of Science and Technology (NTNU), Trondheim, Norway Abstract. Due to the power and frequency

More information

Operating Systems: Internals and Design Principles. Chapter 2 Operating System Overview Seventh Edition By William Stallings

Operating Systems: Internals and Design Principles. Chapter 2 Operating System Overview Seventh Edition By William Stallings Operating Systems: Internals and Design Principles Chapter 2 Operating System Overview Seventh Edition By William Stallings Operating Systems: Internals and Design Principles Operating systems are those

More information

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh

More information

EXPOSING PARTICLE PARALLELISM IN THE XGC PIC CODE BY EXPLOITING GPU MEMORY HIERARCHY. Stephen Abbott, March

EXPOSING PARTICLE PARALLELISM IN THE XGC PIC CODE BY EXPLOITING GPU MEMORY HIERARCHY. Stephen Abbott, March EXPOSING PARTICLE PARALLELISM IN THE XGC PIC CODE BY EXPLOITING GPU MEMORY HIERARCHY Stephen Abbott, March 26 2018 ACKNOWLEDGEMENTS Collaborators: Oak Ridge Nation Laboratory- Ed D Azevedo NVIDIA - Peng

More information

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA OPTIMIZATIONS ISC 2011 Tutorial CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Optimizing CUDA for GPU Architecture. CSInParallel Project

Optimizing CUDA for GPU Architecture. CSInParallel Project Optimizing CUDA for GPU Architecture CSInParallel Project August 13, 2014 CONTENTS 1 CUDA Architecture 2 1.1 Physical Architecture........................................... 2 1.2 Virtual Architecture...........................................

More information

Performance Optimization Part II: Locality, Communication, and Contention

Performance Optimization Part II: Locality, Communication, and Contention Lecture 7: Performance Optimization Part II: Locality, Communication, and Contention Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Beth Rowley Nobody s Fault but Mine

More information

Back-Projection on GPU: Improving the Performance

Back-Projection on GPU: Improving the Performance UNIVERSITY OF MICHIGAN Back-Projection on GPU: Improving the Performance EECS 499 Independent Study Wenlay Esther Wei 4/29/2010 The purpose of this project is to accelerate the processing speed of the

More information

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo N-Body Simulation using CUDA CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo Project plan Develop a program to simulate gravitational

More information

Peak Performance for an Application in CUDA

Peak Performance for an Application in CUDA Peak Performance for an Application in CUDA Nicolás Wolovick 1 Fa.M.A.F., Universidad Nacional de Córdoba, Argentina May 4, 2010 Fa.M.A.F. - U.N.C. Outline The Numeric Problem Serial Codes Our aim CUDA

More information

Parallelising Pipelined Wavefront Computations on the GPU

Parallelising Pipelined Wavefront Computations on the GPU Parallelising Pipelined Wavefront Computations on the GPU S.J. Pennycook G.R. Mudalige, S.D. Hammond, and S.A. Jarvis. High Performance Systems Group Department of Computer Science University of Warwick

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Fundamental Optimizations (GTC 2010) Paulius Micikevicius NVIDIA Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Optimization

More information

Warps and Reduction Algorithms

Warps and Reduction Algorithms Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

CUDA Performance Optimization. Patrick Legresley

CUDA Performance Optimization. Patrick Legresley CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations

More information

CUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission)

CUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission) CUDA Memory Model Some material David Kirk, NVIDIA and Wen-mei W. Hwu, 2007-2009 (used with permission) 1 G80 Implementation of CUDA Memories Each thread can: Grid Read/write per-thread registers Read/write

More information

CUDA Experiences: Over-Optimization and Future HPC

CUDA Experiences: Over-Optimization and Future HPC CUDA Experiences: Over-Optimization and Future HPC Carl Pearson 1, Simon Garcia De Gonzalo 2 Ph.D. candidates, Electrical and Computer Engineering 1 / Computer Science 2, University of Illinois Urbana-Champaign

More information

RT 3D FDTD Simulation of LF and MF Room Acoustics

RT 3D FDTD Simulation of LF and MF Room Acoustics RT 3D FDTD Simulation of LF and MF Room Acoustics ANDREA EMANUELE GRECO Id. 749612 andreaemanuele.greco@mail.polimi.it ADVANCED COMPUTER ARCHITECTURES (A.A. 2010/11) Prof.Ing. Cristina Silvano Dr.Ing.

More information

A Parallel Implementation of the 3D NUFFT on Distributed-Memory Systems

A Parallel Implementation of the 3D NUFFT on Distributed-Memory Systems A Parallel Implementation of the 3D NUFFT on Distributed-Memory Systems Yuanxun Bill Bao May 31, 2015 1 Introduction The non-uniform fast Fourier transform (NUFFT) algorithm was originally introduced by

More information

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors Michael Boyer, David Tarjan, Scott T. Acton, and Kevin Skadron University of Virginia IPDPS 2009 Outline Leukocyte

More information