CS267 Report Particle Simulation on a GPU with PyCUDA
|
|
- Allyson Tate
- 6 years ago
- Views:
Transcription
1 CS267 Report Particle Simulation on a GPU with PyCUDA Min Ragan-Kelley minrk@berkeley.edu May 12, Introduction This report is on a small test problem within the context of a larger long-term research project. GPUs are increasingly popular for particle methods, due to the readily apparent parallelism inherent to N-Body problems. Particle-In-Cell is a popular scheme for exploring systems in plasma physics. We hope to explore a small sample problem in order to gauge the viability of using PyCUDA to run interactive-scale particle simulations on a GPU for use in our research. 1.1 Project Context The Plasma Simulation and Theory Group (PTSG) at UC Berkeley maintains a large Object- Oriented Particle-In-Cell (OOPIC) code, largely built in the mid 1990s [1]. It is primarily a 2D system, and is a serial code. Applications of OOPC range from small interactive simulations, running many timesteps per second on a laptop, to large offline simulations up to days or weeks in runtime. The goal of the object oriented design is to facilitate continued development of new physics behaviors to simulations over the lifetime of the code, and has been quite successful, and OOPIC continues to enjoy active use and development. OOPIC s shortcomings in the present landscape are twofold. First, the User Interface (UI) is mouse-only button based input. This makes programmatic interaction with the simulation impossible. The diagnostics available are also static - if a user would like to add a Derivative Diagnostic (for example: given diagnostics J and B, add J B ), their only option is to add it to the C++ source code and recompile the program. The second major shortcoming is performance. OOPIC is a fast serial code, but parallelism was not considered in the original design. Basic parallel functionality has been implemented with MPI, but does not scale beyond a few processors. The goal of this project is to build a system with similar extensibility to existing OOPIC, but solve its parallel performance and interface problems. Our plan to accomplish this is by writing the application logic in Python, with the performance components as native C or CUDA kernels. Adding new physics capabilities should be at least as well facilitated as in current OOPIC by 1
2 a modular design. Having the interface itself be Python, we get a very powerful programmatic interface and user environment. Having NumPy arrays as the native data structure for diagnostics allows us to use existing plotting libraries such as matplotlib and Chaco, without having to write our own plotting system [5] [7] [8]. NumPy arrays are also the native data structure of most data analysis tools in Python, so our users gain access to all the powerful facilities of tools such as SciPy for free, just by our decision to present diagnostics as NumPy arrays [6]. Hierarchical parallelism for multi-device simulations will be a part of the design from the beginning. Spatial Decomposition occurs at the device level, and each Spatial Region communicates with others via MPI or Shared Memory in Host Code. Each device has its own Python process, issuing control and diagnostic commands, and the Python processes communicate via the IPython parallel computing kernel, and the native code backends will also be able to communicate with each other directly, via MPI. A parent Controller process is the User s interface to the whole simulation, which merges the diagnostic information from the various components, as seen in Figure 1. Our ultimate goal is to be able to utilize a system with: 1 machines on a network, where each machine has 1 CPU core and 1 compute capable GPU. Our short term plan, however, is just to explore the use of a single GPU. IPython Controller Control, Relay Merged Diagnostics IPython Kernel Control Derivative Diagnostics MPI C Sim Backend Native Diagnostics Figure 1: Schematic of the connections in the system. Python processes communicate control commands with each other and the native backends, while the backends communicate directly with each other using high performance systems, such as MPI. 1.2 PyCUDA PyCUDA is a module for Python that exposes the CUDA drivers to Python via the Boost library[4]. It has a variety of sophisticated functionalities. At its most basic, it works as a translation layer, allowing Python code to launch CUDA kernels and transfer data to and from CUDA enabled devices. In our case, we wrote a kernel fully in C+CUDA, setup the simulation with Python, and launched the CUDA kernel with the Python defined data via PyCUDA. 2
3 1.3 Test Problem The scope of this project is a small sample simulation, to test the feasibility of using PyCUDA as a backend for particle simulations. The model to be simulated is a combination of that explored in CS267 HW2 and the N-Body simulation found in NVIDIA s GPU Gems 3 and the CUDA SDK [2] [3]. The Gems simulation is a 3D all-pairs N-Body gravitational (least-squares attractive) simulation with free boundary conditions. We adapted this to be more like the simulation in HW2. The simulation is a 2D least-squares repulsive (Coulomb) simulation, where interactions beyond a certain distance are approximated as zero (short-range force approximation). The boundary conditions are specular (elastic) reflection off the x boundaries, and periodic boundary conditions in y. Particles are given random (uniformly distributed) initial velocity in the +y direction, and random (uniformly distributed) velocity in ±x, 1 order of magnitude smaller than the +y velocity. v o = (v rand( 1, 1) )ˆx + (v rand(0, 1))ŷ (1) 10 Particles are randomly distributed in the region, centered in x, filling half the width of the container, and the full height. r o = ( X 4 rand(1, 3))ˆx + (Y rand(0, 1))ŷ (2) where X ranges from Y/10 to Y/2. Figure 2: Schematic of the simulation. As particles leave the top, they wrap around the bottom. The resulting simulation approximates a sheet beam of electrons under free expansion, infinite in y and z, see Figure 2. 2 Implementation Starting with the Gems N-Body kernel, our major adaptations were to account for the short-range approximation, reduce the dimension to 2D, and apply boundary conditions. The short-range interaction means that most interactions are zero. To deal with this, the model, as in HW2, was to 3
4 // r i j [ 2 FLOPS] r. x = bj. x bi. x ; r. y = bj. y bi. y ; // d i s t S q r = dot ( r i j, r i j ) + \ e p s i l o n ˆ2 [ 4 FLOPS] T distsqr = r. x * r. x + r. y * r. y ; distsqr += getsofteningsquared<t >() ; i f ( distsqr > cutoff ) { // branch // invdistcube =1/ d i s t S q r ˆ ( 3 / 2 ) [ 4 FLOPS (2 mul, 1 sqrt, 1 inv ) ] T invdist = rsqrt_t ( distsqr ) ; T invdistcube = invdist * invdist * invdist ; // s = m j * invdistcube [ 1 FLOP] T s = bj. w * invdistcube ; } // a i = a i + s * r i j [ 4 FLOPS] ai. x += r. x * s ; ai. y += r. y * s ; break up the simulation into subregions. All interactions within a subregion are computed. Each subregion also contains ghost particles from neighboring subregions within a cutoff distance. In 2D, it takes 15 FLOPS to compute a non-zero interaction, and 6 flops to compute a zero interaction, as seen in the interaction routine above. 2.1 Subregions For simplicity, we only implemented 1D subdivisions, so each subregion fills the width of the system for a slice in y. The model computes a nonzero interaction (15 FLOPS) for every particle within the cutoff range, and a zero interaction (6 FLOPS) for every particle in the subdomain that is not within the cutoff range. By analyzing the area of the cutoff radius versus the subdomain, we can see approximately how much work we waste. Y A sub = X ( + 2r c ) n subs (3) A nonzero = πrc 2 (4) real work 15 A sub (5) wasted effort 6(A sub A nonzero ) (6) It would make sense to have square subdomains only slightly larger than the cutoff radius - this would cut down on the wasted effort a great deal. However, the bookkeeping involved in such a scheme requires a great deal of if-tests and memory access, which is much more expensive relative to computation on a GPU than on a CPU. So we decided to waste FLOPS rather than spend extra time and memory bookkeeping, as is common practice when programming GPUs. We intended to quantify the performance difference, but did not succeed in implementing the latter method in 4
5 Figure 3: Diagram showing cutoff radius of the red particle near the edge of a subdomain, including the ghost particles. The ghost region is the dotted line, and particles outside the dotted line reside in a different subregion, and are not considered for interactions, since they are guaranteed to be outside r c. total work regions 4 regions 16 regions 64 regions 256 regions Y/r c Figure 4: Total work computing interactions (including zeros) versus the size of the cutoff distance relative to the simulated system. The dotted line is the amount of work spent calculating non-zero interactions. time. 5
6 The scheme can be summarized as follows: f o r each subdomain S : # parallel ( rows in grid ) f o r each particle p in S excluding ghost particles : # parallel ( threads / thread blocks ) f o r each particle q in S including ghost particles : interact ( p, q ) advance ( p ) i f p will leave : add p to send queues send / recv new particles send / recv ghost particles 2.2 Threads and Blocks The kernel is run with a number of CUDA thread blocks that is evenly divisible by the number of subdivision. That way, each subdivision gets the same number of thread blocks. It is a single particle kernel, and the number of blocks is arranged such that there is a maximum number of threads (and thus particles) in a given subdomain. For the relative uniformity of this simulation, simply using a safety factor of 1.5, that is, allowing for 1.5x the number of particles expected by the average density. This scheme of thread allocation is not appropriate for nonuniform workloads, but is adequate for the test problem. 3 Performance We ran our simulation in single precision on NVIDIA GT200 GPUs (GTX-260 and Tesla C1060). We found that in order to minimize wasted computation, choosing the subregion height to be 2x of the cutoff distance, and our plots all use a cutoff distance r c = Y/256 and X = Y/10. 6
7 GFLOPS C1060: 256 tpb GTX 260: 256 tpb C1060: 64 tpb GTX 260: 64 tpb C1060: 32 tpb GTX 260: 32 tpb N Figure 5: Performance for various arrangements of threads per block (tpb). Note that 32 tbp is slower than larger numbers, but on average 64 tpb and 256 tbp are very similar. time C1060: 256 tpb GTX 260: 256 tpb C1060: 64 tpb GTX 260: 64 tpb C1060: 32 tpb GTX 260: 32 tpb N Figure 6: Runtime (normalized) for various configurations. Note the step function for 256 threads per block. 3.1 Periodicity In the plot of performance in Figure 5, there is an interesting periodic structure. By looking at a closeup of time vs N in Figures 6 and 7, we see that the runtime is close to a step function for larger thread blocks. The length of the steps appears to be proportional to (and slightly greater than) the number of threads per block * number of SMs on the chip. This makes sense, because if the GPU is fully utilized, it is running one thread block on each SM at a time. If the number of blocks is evenly divisible by the number of SMs, then the GPU will be near full load for the whole simulation. However, if N BLOCKs % N SMs = 1, then the GPU will spend an entire round with only one SM occupied. Further, adding blocks after that will be able to be run concurrently with 7
8 time C1060: 256 tpb GTX 260: 256 tpb N Figure 7: Closeup of the step function in time for 256 threads per block. Note that the period is 10% shorter for the 260, corresponding to having 10% fewer SMs. the first new block and not significantly affect the time until the pass, as illustrated in Figure 8. This results in a timing function to describe the run time of the simulation. t = s t 0 ((B + SM 1)/SM) (7) where t 0 is the typical time it takes to evaluate a block, s is the number of steps to evolve, B is the number of Thread Blocks, SM is the number of SMs on the GPU, and the / represents integer division. Further, B = 1.5N/tpb, rounded up for even divisibility among subregions. Noting that the wasted time is a function of the size of a thread block, there is an incentive to keep the number of threads per block low. However, the overhead of switching puts a lower limit on that number. Running with 64 threads per block gets approximately the same average performance as 256, but with a much smoother line. 8
9 t B%SM = SM Figure 8: Illustration of timing for numbers of thread blocks. B is the number of thread blocks, SM is the number of SMs on the GPU. SM=27 on the 260, and SM=30 on the Tesla board. 3.2 Overhead There is overhead associated with reducing the number of nonzero interactions. Each timestep, particles must be communicated between subregions, and it takes significant work to identify and move those particles. However, since it amounts simply to moving a particle in device memory, the penalty is small. If the workload were not so evenly distributed, there would potentially be a large penalty for having a significant number of idle threads. 3.3 Divergence The if statement in our kernel causes some threads to execute the zero-interaction path, and some to evaluate the full interaction. In CUDA, branch divergence causes a serious performance hit. Branch divergence is per-warp (set of 32 threads), so as long as within a given warp every thread takes the same branch, there should be no suffering. However, we were unable to fully avoid branch divergence, which cost a great deal of performance. To illustrate this, setting the cutoff to be larger than a subregion as in Figure 9, such that the if statement is always True, We see a performance increase of approximately 5%, in Figure 10. Of course, this simulation may be faster, 9
10 but it no longer makes sense, because the model requires that all particles within the cutoff range are included in the force calculation, and only particles within the subregion are actually used. This results in an error where the force calculated on particles closer to the subregion edge than the cutoff is more strongly directed towards the outside of the subregion than if the model were properly calculated. Figure 9: Diagram showing the cutoff radius larger than the subregion, used to eliminate divergence. Particles outside the box should be included in the model, but are outside the search space, which includes only the subdomain. GFLOPS Divergent 256 tpb Convergent 256 tpb Divergent 64 tpb Convergent 64 tpb N Figure 10: Performance for the divergent and invalid convergent codes. The divergence penalty appears to be about 5%. The divergence also appears to introduce some variability into the performance, as the case without divergence seems smoother. 10
11 4 Conclusions and Future Plans We managed to utilize 23% of the theoretical 1TFlop performance of the Tesla GPU 1. When compared to the original GPU Gems implementation, which does a complete all-pairs simulation, that gets closer to ( 50%) of peak. Clearly there is room for improvement, but since our scheme only computes N 2 /128 interactions, a factor of two performance shortage versus a full N 2 model is quite acceptable. In our implementation, we made the assumption that the overhead of a more thorough bookkeeping method would overwhelm the wasted computation of evaluating many zero interactions. It would be both interesting and prudent to explore such a scheme in order to examine the validity of this assumption. Even taking the same 1D scheme we have, we could make the slices across x instead of across y. This would result in dramatically reducing the amount of particle communication, since particles travel predominantly in the y direction, but it could negatively impact the load balancing, since the particles are not uniformly distributed in x. PyCUDA was not integral to performance of the code, nor did it hinder it noticeably. The kernel was handwritten in C+CUDA, adapted from the Gems source available from NVIDIA, and PyCUDA was simply used to load up the simulation and launch the kernel. In this way, PyCUDA allowed for the simple part of the code to be trivial by letting it be written in Python, and the hard code to be fast by running the raw CUDA kernel. Our original goal was to compare this simple method to one or two more sophisticated methods, but this implementation took longer to implement than anticipated. We hope to explore various schemes further, particularly ones more suitable to plasma simulations [9] [10]. Over the next months (and longer), we plan to build a more complete PIC and tree-code simulation on the GPU with PyCUDA to form the starting backend for our broader Python PIC project. There were also plenty of iterations of this code not discussed here, which, running afoul of coalesced and aligned memory access, defaulting double precision constants, and improper synchronization, managed to be far slower than a naive serial N 2 implementation. The general conclusion of the experience is that it is manageable to make a problem reasonably fast on a GPU, but it is very easy to make it very slow. References [1] J.P. Verboncoeur, A.B. Langdon and N.T. Gladd, An Object-Oriented Electromagnetic PIC Code, Comp. Phys. Comm., 87, [2] L. Nyland, M. Harris, J. Prins, GPU Gems 3: Chapter 31. Fast N-Body Simulation with CUDA, ch31.html [3] CUDA SDKhttp:// get.html 1 in case you remember a different number from my poster, I had made a favorable miscalculation there, giving closer to 30% of peak. 11
12 [4] A. Klöckner, PyCUDA, [5] NumPy, [6] SciPy, [7] matplotlib, [8] Chaco, [9] J. Barnes, and P. Hut. A Hierarchical O(n log n) Force Calculation Algorithm. Nature 324, [10] L. Greengard, et. al. A New Version of the Fast Multipole Method for Screened Coulomb Interactions in Three Dimensions, JCP 180,
Double-Precision Matrix Multiply on CUDA
Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices
More informationCS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in
More informationAcceleration of a Python-based Tsunami Modelling Application via CUDA and OpenHMPP
Acceleration of a Python-based Tsunami Modelling Application via CUDA and OpenHMPP Zhe Weng and Peter Strazdins*, Computer Systems Group, Research School of Computer Science, The Australian National University
More informationKSTAR tokamak. /
KSTAR tokamak / spinhalf@nfri.re.kr !!! Data parallelism CUDA programming python! pycuda GPU Development tools Python 2.6+ Scientific libraries as python package for interactive computing (Numpy, Scipy..)
More informationCOMPARISON OF PYTHON 3 SINGLE-GPU PARALLELIZATION TECHNOLOGIES ON THE EXAMPLE OF A CHARGED PARTICLES DYNAMICS SIMULATION PROBLEM
COMPARISON OF PYTHON 3 SINGLE-GPU PARALLELIZATION TECHNOLOGIES ON THE EXAMPLE OF A CHARGED PARTICLES DYNAMICS SIMULATION PROBLEM A. Boytsov 1,a, I. Kadochnikov 1,4,b, M. Zuev 1,c, A. Bulychev 2,d, Ya.
More informationDynamic load balancing in OSIRIS
Dynamic load balancing in OSIRIS R. A. Fonseca 1,2 1 GoLP/IPFN, Instituto Superior Técnico, Lisboa, Portugal 2 DCTI, ISCTE-Instituto Universitário de Lisboa, Portugal Maintaining parallel load balance
More informationGPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh
GPU Performance Optimisation EPCC The University of Edinburgh Hardware NVIDIA accelerated system: Memory Memory GPU vs CPU: Theoretical Peak capabilities NVIDIA Fermi AMD Magny-Cours (6172) Cores 448 (1.15GHz)
More informationAn Efficient CUDA Implementation of a Tree-Based N-Body Algorithm. Martin Burtscher Department of Computer Science Texas State University-San Marcos
An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm Martin Burtscher Department of Computer Science Texas State University-San Marcos Mapping Regular Code to GPUs Regular codes Operate on
More informationMD-CUDA. Presented by Wes Toland Syed Nabeel
MD-CUDA Presented by Wes Toland Syed Nabeel 1 Outline Objectives Project Organization CPU GPU GPGPU CUDA N-body problem MD on CUDA Evaluation Future Work 2 Objectives Understand molecular dynamics (MD)
More informationGPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA
GPGPU LAB Case study: Finite-Difference Time- Domain Method on CUDA Ana Balevic IPVS 1 Finite-Difference Time-Domain Method Numerical computation of solutions to partial differential equations Explicit
More informationCUDA Particles. Simon Green
CUDA Particles Simon Green sdkfeedback@nvidia.com Document Change History Version Date Responsible Reason for Change 1.0 Sept 19 2007 Simon Green Initial draft Abstract Particle systems [1] are a commonly
More informationB. Tech. Project Second Stage Report on
B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic
More informationCode Optimizations for High Performance GPU Computing
Code Optimizations for High Performance GPU Computing Yi Yang and Huiyang Zhou Department of Electrical and Computer Engineering North Carolina State University 1 Question to answer Given a task to accelerate
More informationCUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.
Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication
More informationOptimisation Myths and Facts as Seen in Statistical Physics
Optimisation Myths and Facts as Seen in Statistical Physics Massimo Bernaschi Institute for Applied Computing National Research Council & Computer Science Department University La Sapienza Rome - ITALY
More informationDirected Optimization On Stencil-based Computational Fluid Dynamics Application(s)
Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Islam Harb 08/21/2015 Agenda Motivation Research Challenges Contributions & Approach Results Conclusion Future Work 2
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More information2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA
2006: Short-Range Molecular Dynamics on GPU San Jose, CA September 22, 2010 Peng Wang, NVIDIA Overview The LAMMPS molecular dynamics (MD) code Cell-list generation and force calculation Algorithm & performance
More informationParallel Implementation of 3D FMA using MPI
Parallel Implementation of 3D FMA using MPI Eric Jui-Lin Lu y and Daniel I. Okunbor z Computer Science Department University of Missouri - Rolla Rolla, MO 65401 Abstract The simulation of N-body system
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationUsing GPUs to compute the multilevel summation of electrostatic forces
Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of
More informationIntroduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware
More informationCUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni
CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3
More informationOutline. L9: Project Discussion and Floating Point Issues. Project Parts (Total = 50%) Project Proposal (due 3/8) 2/13/12.
Outline L9: Project Discussion and Floating Point Issues Discussion of semester projects Floating point Mostly single precision until recent architectures Accuracy What s fast and what s not Reading: Ch
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationDense Linear Algebra. HPC - Algorithms and Applications
Dense Linear Algebra HPC - Algorithms and Applications Alexander Pöppl Technical University of Munich Chair of Scientific Computing November 6 th 2017 Last Tutorial CUDA Architecture thread hierarchy:
More informationMaximizing Face Detection Performance
Maximizing Face Detection Performance Paulius Micikevicius Developer Technology Engineer, NVIDIA GTC 2015 1 Outline Very brief review of cascaded-classifiers Parallelization choices Reducing the amount
More informationAbstract. Introduction. Kevin Todisco
- Kevin Todisco Figure 1: A large scale example of the simulation. The leftmost image shows the beginning of the test case, and shows how the fluid refracts the environment around it. The middle image
More informationExploiting graphical processing units for data-parallel scientific applications
CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2009; 21:2400 2437 Published online 20 July 2009 in Wiley InterScience (www.interscience.wiley.com)..1462 Exploiting
More informationStream Processing with CUDA TM A Case Study Using Gamebryo's Floodgate Technology
Stream Processing with CUDA TM A Case Study Using Gamebryo's Floodgate Technology Dan Amerson, Technical Director, Emergent Game Technologies Purpose Why am I giving this talk? To answer this question:
More informationCUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA
CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0 Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nsight VSE APOD
More informationOptimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as
More informationGPU Implementation of a Multiobjective Search Algorithm
Department Informatik Technical Reports / ISSN 29-58 Steffen Limmer, Dietmar Fey, Johannes Jahn GPU Implementation of a Multiobjective Search Algorithm Technical Report CS-2-3 April 2 Please cite as: Steffen
More informationProfiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency
Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu
More informationPerformance potential for simulating spin models on GPU
Performance potential for simulating spin models on GPU Martin Weigel Institut für Physik, Johannes-Gutenberg-Universität Mainz, Germany 11th International NTZ-Workshop on New Developments in Computational
More information1/25/12. Administrative
Administrative L3: Memory Hierarchy Optimization I, Locality and Data Placement Next assignment due Friday, 5 PM Use handin program on CADE machines handin CS6235 lab1 TA: Preethi Kotari - Email:
More informationUniversity of Bielefeld
Geistes-, Natur-, Sozial- und Technikwissenschaften gemeinsam unter einem Dach Introduction to GPU Programming using CUDA Olaf Kaczmarek University of Bielefeld STRONGnet Summerschool 2011 ZIF Bielefeld
More informationIntro to Parallel Computing
Outline Intro to Parallel Computing Remi Lehe Lawrence Berkeley National Laboratory Modern parallel architectures Parallelization between nodes: MPI Parallelization within one node: OpenMP Why use parallel
More informationHigh performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli
High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering
More informationTuning CUDA Applications for Fermi. Version 1.2
Tuning CUDA Applications for Fermi Version 1.2 7/21/2010 Next-Generation CUDA Compute Architecture Fermi is NVIDIA s next-generation CUDA compute architecture. The Fermi whitepaper [1] gives a detailed
More informationAccelerating Molecular Modeling Applications with Graphics Processors
Accelerating Molecular Modeling Applications with Graphics Processors John Stone Theoretical and Computational Biophysics Group University of Illinois at Urbana-Champaign Research/gpu/ SIAM Conference
More informationOptimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as
More informationTechnical Documentation Version 7.4. Performance
Technical Documentation Version 7.4 These documents are copyrighted by the Regents of the University of Colorado. No part of this document may be reproduced, stored in a retrieval system, or transmitted
More informationPersistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL
(stashing recurrent weights on-chip) Baidu SVAIL April 7, 2016 SVAIL Think hard AI. Goal Develop hard AI technologies that impact 100 million users. Deep Learning at SVAIL 100 GFLOP/s 1 laptop 6 TFLOP/s
More informationShallow Water Simulations on Graphics Hardware
Shallow Water Simulations on Graphics Hardware Ph.D. Thesis Presentation 2014-06-27 Martin Lilleeng Sætra Outline Introduction Parallel Computing and the GPU Simulating Shallow Water Flow Topics of Thesis
More informationGPU Programming Using NVIDIA CUDA
GPU Programming Using NVIDIA CUDA Siddhante Nangla 1, Professor Chetna Achar 2 1, 2 MET s Institute of Computer Science, Bandra Mumbai University Abstract: GPGPU or General-Purpose Computing on Graphics
More informationParallel Compact Roadmap Construction of 3D Virtual Environments on the GPU
Parallel Compact Roadmap Construction of 3D Virtual Environments on the GPU Avi Bleiweiss NVIDIA Corporation Programmability GPU Computing CUDA C++ Parallel Debug Heterogeneous Computing Productivity Efficiency
More informationHeight field ambient occlusion using CUDA
Height field ambient occlusion using CUDA 3.6.2009 Outline 1 2 3 4 Theory Kernel 5 Height fields Self occlusion Current methods Marching several directions from each fragment Sampling several times along
More informationA4. Intro to Parallel Computing
Self-Consistent Simulations of Beam and Plasma Systems Steven M. Lund, Jean-Luc Vay, Rémi Lehe and Daniel Winklehner Colorado State U., Ft. Collins, CO, 13-17 June, 2016 A4. Intro to Parallel Computing
More informationRunning the NIM Next-Generation Weather Model on GPUs
Running the NIM Next-Generation Weather Model on GPUs M.Govett 1, J.Middlecoff 2 and T.Henderson 2 1 NOAA Earth System Research Laboratory, Boulder, CO, USA 2 Cooperative Institute for Research in the
More informationDense matching GPU implementation
Dense matching GPU implementation Author: Hailong Fu. Supervisor: Prof. Dr.-Ing. Norbert Haala, Dipl. -Ing. Mathias Rothermel. Universität Stuttgart 1. Introduction Correspondence problem is an important
More informationAccelerated Machine Learning Algorithms in Python
Accelerated Machine Learning Algorithms in Python Patrick Reilly, Leiming Yu, David Kaeli reilly.pa@husky.neu.edu Northeastern University Computer Architecture Research Lab Outline Motivation and Goals
More informationCUDA Programming Model
CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming
More informationNumerical Simulation on the GPU
Numerical Simulation on the GPU Roadmap Part 1: GPU architecture and programming concepts Part 2: An introduction to GPU programming using CUDA Part 3: Numerical simulation techniques (grid and particle
More informationExploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology
Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation
More informationMatrix Multiplication in CUDA. A case study
Matrix Multiplication in CUDA A case study 1 Matrix Multiplication: A Case Study Matrix multiplication illustrates many of the basic features of memory and thread management in CUDA Usage of thread/block
More informationMolecular Dynamics Simulations with Julia
Emily Crabb 6.338/18.337 Final Project Molecular Dynamics Simulations with Julia I. Project Overview This project consists of one serial and several parallel versions of a molecular dynamics simulation
More informationIntroduction to CUDA
Introduction to CUDA Overview HW computational power Graphics API vs. CUDA CUDA glossary Memory model, HW implementation, execution Performance guidelines CUDA compiler C/C++ Language extensions Limitations
More informationGPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC
GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of
More informationInterval arithmetic on graphics processing units
Interval arithmetic on graphics processing units Sylvain Collange*, Jorge Flórez** and David Defour* RNC'8 July 7 9, 2008 * ELIAUS, Université de Perpignan Via Domitia ** GILab, Universitat de Girona How
More informationReview: Creating a Parallel Program. Programming for Performance
Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)
More informationAsynchronous OpenCL/MPI numerical simulations of conservation laws
Asynchronous OpenCL/MPI numerical simulations of conservation laws Philippe HELLUY 1,3, Thomas STRUB 2. 1 IRMA, Université de Strasbourg, 2 AxesSim, 3 Inria Tonus, France IWOCL 2015, Stanford Conservation
More informationOverview. Videos are everywhere. But can take up large amounts of resources. Exploit redundancy to reduce file size
Overview Videos are everywhere But can take up large amounts of resources Disk space Memory Network bandwidth Exploit redundancy to reduce file size Spatial Temporal General lossless compression Huffman
More informationGeneral Purpose GPU Computing in Partial Wave Analysis
JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data
More informationarxiv: v1 [cs.dc] 24 Feb 2010
Deterministic Sample Sort For GPUs arxiv:1002.4464v1 [cs.dc] 24 Feb 2010 Frank Dehne School of Computer Science Carleton University Ottawa, Canada K1S 5B6 frank@dehne.net http://www.dehne.net February
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control
More informationHighly Efficient Compensationbased Parallelism for Wavefront Loops on GPUs
Highly Efficient Compensationbased Parallelism for Wavefront Loops on GPUs Kaixi Hou, Hao Wang, Wu chun Feng {kaixihou, hwang121, wfeng}@vt.edu Jeffrey S. Vetter, Seyong Lee vetter@computer.org, lees2@ornl.gov
More informationCS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose
CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose Synchronization Ideal case for parallelism: no resources shared between threads no communication between threads Many
More informationPerformance impact of dynamic parallelism on different clustering algorithms
Performance impact of dynamic parallelism on different clustering algorithms Jeffrey DiMarco and Michela Taufer Computer and Information Sciences, University of Delaware E-mail: jdimarco@udel.edu, taufer@udel.edu
More informationDynamic Load Balancing on Single- and Multi-GPU Systems
Dynamic Load Balancing on Single- and Multi-GPU Systems Long Chen, Oreste Villa, Sriram Krishnamoorthy, Guang R. Gao Department of Electrical & Computer Engineering High Performance Computing University
More informationCUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. Stephen Jones, GTC 2017
CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES Stephen Jones, GTC 2017 The art of doing more with less 2 Performance RULE #1: DON T TRY TOO HARD Peak Performance Time 3 Unrealistic Effort/Reward Performance
More informationBulk Synchronous and SPMD Programming. The Bulk Synchronous Model. CS315B Lecture 2. Bulk Synchronous Model. The Machine. A model
Bulk Synchronous and SPMD Programming The Bulk Synchronous Model CS315B Lecture 2 Prof. Aiken CS 315B Lecture 2 1 Prof. Aiken CS 315B Lecture 2 2 Bulk Synchronous Model The Machine A model An idealized
More informationCUDA Particles. Simon Green
CUDA Particles Simon Green sdkfeedback@nvidia.com Document Change History Version Date Responsible Reason for Change 1.0 Sept 19 2007 Simon Green Initial draft 1.1 Nov 3 2007 Simon Green Fixed some mistakes,
More informationDisk Scheduling COMPSCI 386
Disk Scheduling COMPSCI 386 Topics Disk Structure (9.1 9.2) Disk Scheduling (9.4) Allocation Methods (11.4) Free Space Management (11.5) Hard Disk Platter diameter ranges from 1.8 to 3.5 inches. Both sides
More informationGPU Fundamentals Jeff Larkin November 14, 2016
GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate
More informationUnrolling parallel loops
Unrolling parallel loops Vasily Volkov UC Berkeley November 14, 2011 1 Today Very simple optimization technique Closely resembles loop unrolling Widely used in high performance codes 2 Mapping to GPU:
More informationOptimization solutions for the segmented sum algorithmic function
Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code
More informationModelling Multi-GPU Systems 1
Modelling Multi-GPU Systems 1 Daniele G. SPAMPINATO a, Anne C. ELSTER a and Thorvald NATVIG a a Norwegian University of Science and Technology (NTNU), Trondheim, Norway Abstract. Due to the power and frequency
More informationOperating Systems: Internals and Design Principles. Chapter 2 Operating System Overview Seventh Edition By William Stallings
Operating Systems: Internals and Design Principles Chapter 2 Operating System Overview Seventh Edition By William Stallings Operating Systems: Internals and Design Principles Operating systems are those
More informationHARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA
HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh
More informationEXPOSING PARTICLE PARALLELISM IN THE XGC PIC CODE BY EXPLOITING GPU MEMORY HIERARCHY. Stephen Abbott, March
EXPOSING PARTICLE PARALLELISM IN THE XGC PIC CODE BY EXPLOITING GPU MEMORY HIERARCHY Stephen Abbott, March 26 2018 ACKNOWLEDGEMENTS Collaborators: Oak Ridge Nation Laboratory- Ed D Azevedo NVIDIA - Peng
More informationCUDA OPTIMIZATIONS ISC 2011 Tutorial
CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationOptimizing CUDA for GPU Architecture. CSInParallel Project
Optimizing CUDA for GPU Architecture CSInParallel Project August 13, 2014 CONTENTS 1 CUDA Architecture 2 1.1 Physical Architecture........................................... 2 1.2 Virtual Architecture...........................................
More informationPerformance Optimization Part II: Locality, Communication, and Contention
Lecture 7: Performance Optimization Part II: Locality, Communication, and Contention Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Beth Rowley Nobody s Fault but Mine
More informationBack-Projection on GPU: Improving the Performance
UNIVERSITY OF MICHIGAN Back-Projection on GPU: Improving the Performance EECS 499 Independent Study Wenlay Esther Wei 4/29/2010 The purpose of this project is to accelerate the processing speed of the
More informationN-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo
N-Body Simulation using CUDA CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo Project plan Develop a program to simulate gravitational
More informationPeak Performance for an Application in CUDA
Peak Performance for an Application in CUDA Nicolás Wolovick 1 Fa.M.A.F., Universidad Nacional de Córdoba, Argentina May 4, 2010 Fa.M.A.F. - U.N.C. Outline The Numeric Problem Serial Codes Our aim CUDA
More informationParallelising Pipelined Wavefront Computations on the GPU
Parallelising Pipelined Wavefront Computations on the GPU S.J. Pennycook G.R. Mudalige, S.D. Hammond, and S.A. Jarvis. High Performance Systems Group Department of Computer Science University of Warwick
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationKernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow
Fundamental Optimizations (GTC 2010) Paulius Micikevicius NVIDIA Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Optimization
More informationWarps and Reduction Algorithms
Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum
More informationPortland State University ECE 588/688. Graphics Processors
Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly
More informationCUDA Performance Optimization. Patrick Legresley
CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations
More informationCUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission)
CUDA Memory Model Some material David Kirk, NVIDIA and Wen-mei W. Hwu, 2007-2009 (used with permission) 1 G80 Implementation of CUDA Memories Each thread can: Grid Read/write per-thread registers Read/write
More informationCUDA Experiences: Over-Optimization and Future HPC
CUDA Experiences: Over-Optimization and Future HPC Carl Pearson 1, Simon Garcia De Gonzalo 2 Ph.D. candidates, Electrical and Computer Engineering 1 / Computer Science 2, University of Illinois Urbana-Champaign
More informationRT 3D FDTD Simulation of LF and MF Room Acoustics
RT 3D FDTD Simulation of LF and MF Room Acoustics ANDREA EMANUELE GRECO Id. 749612 andreaemanuele.greco@mail.polimi.it ADVANCED COMPUTER ARCHITECTURES (A.A. 2010/11) Prof.Ing. Cristina Silvano Dr.Ing.
More informationA Parallel Implementation of the 3D NUFFT on Distributed-Memory Systems
A Parallel Implementation of the 3D NUFFT on Distributed-Memory Systems Yuanxun Bill Bao May 31, 2015 1 Introduction The non-uniform fast Fourier transform (NUFFT) algorithm was originally introduced by
More informationAccelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors
Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors Michael Boyer, David Tarjan, Scott T. Acton, and Kevin Skadron University of Virginia IPDPS 2009 Outline Leukocyte
More information