An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm. Martin Burtscher Department of Computer Science Texas State University-San Marcos

Size: px
Start display at page:

Download "An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm. Martin Burtscher Department of Computer Science Texas State University-San Marcos"

Transcription

1 An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm Martin Burtscher Department of Computer Science Texas State University-San Marcos

2 Mapping Regular Code to GPUs Regular codes Operate on array- and matrix-based data structures Exhibit mostly strided memory access patterns Have relatively predictable control flow (control flow behavior is largely determined by input size) Many regular codes are easy to port to GPUs E.g., matrix codes executing many ops/word Dense matrix operations (level 2 and 3 BLAS) Stencil codes (PDE solvers) LLNL An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 2

3 Mapping Irregular Code to GPUs Irregular codes Build, traverse, and update dynamic data structures (e.g., trees, graphs, linked lists, and priority queues) Exhibit data dependent memory access patterns Have complex control flow (control flow behavior depends on input values and changes dynamically) Many important scientific programs are irregular E.g., data clustering, SAT solving, social networks, meshing, Need case studies to learn how to best map irregular codes FSU An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 3

4 Example: N-Body Simulation Irregular Barnes Hut algorithm Repeatedly builds unbalanced tree & performs complex traversals on it Our implementation Designed for GPUs (not just port of CPU code) First GPU implementation of entire BH algorithm Results GPU is 21 times faster than CPU (6 cores) on this code GPU has better architecture for this irregular algorithm NVIDIA GPU Gems book chapter (2011) An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 4

5 Outline Introduction Barnes Hut algorithm CUDA implementation Experimental results Conclusions NASA/JPL-Caltech/SSC An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 5

6 N-Body Simulation Time evolution of physical system System consists of bodies n is the number of bodies Bodies interact via pair-wise forces RUG Many systems can be modeled in this way Star/galaxy clusters (gravitational force) Particles (electric force, magnetic force) Cornell An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 6

7 Simple O(n 2 ) n-body Algorithm Algorithm Initialize body masses, positions, and velocities Iterate over time steps { Accumulate forces acting on each body Update body positions and velocities based on force } Output result More sophisticated n-body algorithms exist Barnes Hut algorithm Fast Multipole Method (FMM) An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 7

8 Barnes Hut Idea Precise force calculation Requires O(n 2 ) operations (O(n 2 ) body pairs) Computationally intractable for large n Barnes and Hut (1986) Algorithm to approximately compute forces Bodies initial position and velocity are also approximate Requires only O(n log n) operations Idea is to combine far away bodies Error should be small because force is proportional to 1/distance 2 An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 8

9 Barnes Hut Algorithm Set bodies initial position and velocity Iterate over time steps 1. Compute bounding box around bodies 2. Subdivide space until at most one body per cell Record this spatial hierarchy in an octree 3. Compute mass and center of mass of each cell 4. Compute force on bodies by traversing octree Stop traversal path when encountering a leaf (body) or an internal node (cell) that is far enough away 5. Update each body s position and velocity An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 9

10 Build Tree (Level 1) o Compute bounding box around all bodies tree root An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 10

11 Build Tree (Level 2) o Subdivide space until at most one body per cell An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 11

12 Build Tree (Level 3) o Subdivide space until at most one body per cell An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 12

13 Build Tree (Level 4) o Subdivide space until at most one body per cell An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 13

14 Build Tree (Level 5) o Subdivide space until at most one body per cell An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 14

15 Compute Cells Center of Mass o o o For each internal cell, compute sum of mass and weighted average of position of all bodies in subtree; example shows two cells only An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 15

16 Compute Forces o o o Compute force, for example, acting upon green body An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 16

17 Compute Force (short distance) o o o Scan tree depth first from left to right; green portion already completed An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 17

18 Compute Force (down one level) o o o Red center of mass is too close, need to go down one level An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 18

19 Compute Force (long distance) o o o Blue center of mass is far enough away An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 19

20 Compute Force (skip subtree) o o o Therefore, entire subtree rooted in the blue cell can be skipped An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 20

21 Pseudocode bodyset =... foreach timestep do { bounding_box = new Bounding_Box(); foreach Body b in bodyset { bounding_box.include(b); } octree = new Octree(bounding_box); foreach Body b in bodyset { octree.insert(b); } celllist = octree.cellsbylevel(); foreach Cell c in celllist { c.summarize(); } foreach Body b in bodyset { b.computeforce(octree); } foreach Body b in bodyset { b.advance(); } } o o o o o An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 21

22 Complexity and Parallelism bodyset =... foreach timestep do { // O(n log n) + fully ordered sequential bounding_box = new Bounding_Box(); foreach Body b in bodyset { // O(n) parallel reduction bounding_box.include(b); } octree = new Octree(bounding_box); foreach Body b in bodyset { // O(n log n) top-down tree building octree.insert(b); } celllist = octree.cellsbylevel(); foreach Cell c in celllist { // O(n) + partially ordered bottom-up traversal c.summarize(); } foreach Body b in bodyset { // O(n log n) fully parallel b.computeforce(octree); } foreach Body b in bodyset { // O(n) fully parallel b.advance(); } } An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 22

23 Outline Introduction Barnes Hut algorithm CUDA implementation Experimental results Conclusions An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 23

24 Efficient GPU Code Coalesced main memory accesses Little thread divergence Enough threads per block Not too many registers per thread Not too much shared memory usage Enough (independent) blocks Little synchronization between blocks Little CPU/GPU data transfer Efficient use of shared memory Thepcreport.net An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 24

25 Main BH Implementation Challenges Uses irregular tree-based data structure Load imbalance Little coalescing Complex recursive traversals Recursion not allowed Lots of thread divergence Memory-bound pointer-chasing operations Not enough computation to hide latency o An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 25

26 Six GPU Kernels Read initial data and transfer to GPU for each timestep do { 1. Compute bounding box around bodies 2. Build hierarchical decomposition, i.e., octree 3. Summarize body information in internal octree nodes 4. Approximately sort bodies by spatial location (optional) 5. Compute forces acting on each body with help of octree 6. Update body positions and velocities } Transfer result from GPU and output An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 26

27 Global Optimizations Make code iterative (recursion not supported) Keep data on GPU between kernel calls Use array elements instead of heap nodes One aligned array per field for coalesced accesses objects in array objects on heap fields in arrays An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 27

28 Global Optimizations (cont.) Maximize thread count (round down to warp size) Maximize resident block count (all SMs filled) Pass kernel parameters through constant memory Use special allocation order Alias arrays (56 B/node) Use index arithmetic Persistent blocks & threads Unroll loops over children c0 c2 c4 c1 b5 ba c3 b6 b2 b7 b0 c5 b3 b1 b8 b4 b9 bodies (fixed) cell allocation direction b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 ba... c5 c4 c3 c2 c1 c0 An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 28

29 Kernel 1: Bounding Box (Regular) main memory threads shared memory threads shared memory threads shared memory threads shared memory threads shared memory warp 1 warp 2 warp 3 warp 4 warp 1 warp 2 warp 1 t1 t2 t1 barrier barrier barrier barrier Reduction operation... Optimizations Fully coalesced Fully cached No bank conflicts Minimal divergence Built-in min and max 2 red/mem, 6 red/bar Bodies load balanced 5123 threads per SM An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 29

30 Kernel 2: Build Octree (Irregular) Optimizations Only lock leaf pointers Lock-free fast path Light-weight lock release on slow path No re-traverse after lock acquire failure Combined memory fence per block Re-compute position during traversal Separate init kernels for coalescing 5123 threads per SM Top-down tree building An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 30

31 Kernel 2: Build Octree (cont.) // initialize cell = find_insertion_point(body); // no locks, cache cell child = get_insertion_index(cell, body); if (child!= locked) { // skip atomic if already locked if (child == null) { // fast path (frequent) if (null == atomiccas(&cell[child], null, body)) { // lock-free insertion // move on to next body } } else { // slow path (first part) if (child == atomiccas(&cell[child], child, lock)) { // acquire lock // build subtree with new and existing body flag = true; } } } syncthreads(); // barrier if (threadidx == 0) threadfence(); // push data out if L1 cache syncthreads(); // barrier if (flag) { // slow path (second part) cell[child] = new_subtree; // insert subtree and release lock // move on to next body } An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 31

32 Kernel 3: Summarize Subtrees (Irregular) Bottom-up tree traversal allocation direction scan direction Optimizations Scan avoids deadlock Use mass as flag + fence No locks, no atomics Use wait-free first pass Cache the ready information Piggyback on traversal Count bodies in subtrees No parent pointers 1286 threads per SM An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 32

33 Kernel 4: Sort Bodies (Irregular) Top-down tree traversal allocation direction scan direction Optimizations (Similar to Kernel 3) Scan avoids deadlock Use data field as flag No locks, no atomics Use counts from Kernel 3 Piggyback on traversal Move nulls to back Throttle flag polling requests with optional barrier 646 threads per SM An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 33

34 Kernel 5: Force Calculation (Irregular) Multiple prefix traversals Optimizations Group similar work together Uses sorting to minimize size of prefix union in each warp Early out (nulls in back) Traverse whole union to avoid divergence (warp voting) Lane 0 controls iteration stack for entire warp (fits in shared mem) Avoid unneeded volatile accesses Cache tree-level-based data 2565 threads per SM An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 34

35 Architectural Support Coalesced memory accesses & lockstep execution All threads in warp read same tree node at same time Only one mem access per warp instead of 32 accesses Warp-based execution Enables data sharing in warps without synchronization RSQRTF instruction Quickly computes good approximation of 1/sqrtf(x) Warp voting instructions Quickly perform reduction operations within a warp An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 35

36 Kernel 6: Advance Bodies (Regular) Optimizations Fully coalesced, no divergence Load balanced, threads per SM Straightforward streaming main memory... threads warp 1 warp 2 warp 3 warp 4 warp 1 warp 2... warp 2 main memory... An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 36

37 Outline Introduction Barnes Hut algorithm CUDA implementation Experimental results Conclusions An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 37

38 Evaluation Methodology Implementations CUDA: irregular Barnes Hut & regular O(n 2 ) algorithm OpenMP: Barnes Hut algorithm (derived from CUDA code) Pthreads: Barnes Hut algorithm (from SPLASH-2 suite) Systems and compilers nvcc 4.0 (-O3 -arch=sm_20 -ftz=true) GeForce GTX 480, 1.4 GHz, 15 SMs, 32 cores per SM gcc (-O3 -fopenmp -ffast-math) Xeon X5690, 3.46 GHz, 6 cores, 2 threads per core Inputs and metric 5k, 50k, 500k, and 5M star clusters (Plummer model) Best runtime of three experiments, excluding I/O An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 38

39 Nodes Touched per Activity (5M Input) Kernel activities K1: pair reduction K2: tree insertion K3: bottom-up step K4: top-down step K5: prefix traversal K6: integration step Max tree depth 22 Cells have 3.1 children neighborhood size min avg max kernel kernel kernel kernel kernel , ,315 kernel Prefix 6,315 nodes ( 0.1% of 7.4M) BH algorithm & sorting to minimize size of prefix union work well An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 39

40 k1.1 k1.2 k1.3 k1.4 k1.5 k1.6 k1.7 k1.8 k1.9 k1.10 k1.11 k1.12 k1.13 k1.14 k1.15 k1.16 k1.17 k1.18 k1.19 k1.20 k1.21 k1.22 k1.23 k2.1 k2.2 k2.3 k2.4 k2.5 k2.6 k2.7 k2.8 k2.9 k2.10 k2.11 k2.12 k2.13 k2.14 k2.15 k2.16 k2.17 k2.18 k2.19 k3.1 k3.2 k3.3 k3.4 k3.5 k3.6 k3.7 k3.8 k3.9 k3.10 k3.11 k3.12 k3.13 k3.14 k3.15 k3.16 k3.17 k3.18 k3.19 k3.20 k4.1 k4.2 k4.3 k4.4 k4.5 k4.6 k4.7 k4.8 k4.9 k4.10 k4.11 k4.12 k4.13 k4.14 k4.15 k4.16 k4.17 k4.18 k4.19 k4.20 k5.1 k6.1 available parallelism [independent activities] force calc. integration 10,000,000 Available Amorphous Data Parallelism bounding box tree building summarization sorting 1,000, ,000 10,000 5,000 bodies 50,000 bodies 500,000 bodies 5,000,000 bodies 1, kernel number and iteration An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 40

41 runtime per timestep [ms] Runtime Comparison GPU BH inefficiency 5k input too small for 5,760 to 23,040 threads BH vs. O(n 2 ) algorithm Regular O(n 2 ) code is faster with fewer than about 15,000 bodies GPU vs. CPU (5M input) 21.1x faster than OpenMP 23.2x faster than Pthreads 100,000 10,000 1, GPU CUDA GPU O(n^2) CPU OpenMP CPU Pthreads 5k 50k 500k 5M number of bodies An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 41

42 Kernel Performance for 5M Input $400 GPU delivers 228 GFlops/s on irregular code kernel 1 kernel 2 kernel 3 kernel 4 kernel 5 kernel 6 BarnesHut O(n 2 ) Gflops/s n/a GB/s runtime [ms] GPU chip is 2.7 to 23.5 times faster than CPU chip non-compliant fast single-precision version IEEE 754-compliant double-precision version kernel 1 kernel 2 kernel 3 kernel 4 kernel 5 kernel 6 kernel 1 kernel 2 kernel 3 kernel 4 kernel 5 kernel 6 X5690 CPU , , GTX 480 GPU , , CPU/GPU GPU hardware is better suited for running BH than CPU hw is But more difficult and time consuming to program An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 42

43 Kernel Speedups Optimizations that are generally applicable avoid rsqrtf recalc. thread full multivolatile instr. data voting threading 50, x 1.43x 0.99x 2.04x 20.80x 500, x 1.47x 1.32x 2.49x 27.99x 5,000, x 1.46x 1.69x 2.47x 28.85x Optimizations for irregular kernels throttling waitfree combined sorting of sync'ed barrier pre-pass mem fence bodies execution 50, x 1.02x 1.54x 3.60x 6.23x 500, x 1.21x 1.57x 6.28x 8.04x 5,000, x 1.31x 1.50x 8.21x 8.60x An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 43

44 Outline Introduction Barnes Hut algorithm CUDA implementation Experimental results Conclusions An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 44

45 Optimization Summary Minimize thread divergence Group similar work together, force synchronicity Reduce main memory accesses Share data within warp, combine memory fences & traversals, re-compute data, avoid volatile accesses Implement entire algorithm on and for GPU Avoid data transfers & data structure inefficiencies, wait-free pre-pass, scan entire prefix union An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 45

46 Optimization Summary (cont.) Exploit hardware features Fast synchronization and thread startup, special instructions, coalesced memory accesses, even lockstep execution Use light-weight locking and synchronization Minimize locks, reuse fields, use fence + store ops Maximize parallelism Parallelize every step within and across SMs An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 46

47 Conclusions Irregularity does not necessarily prevent high-performance on GPUs Entire Barnes Hut algorithm implemented on GPU Builds and traverses unbalanced octree GPU is 22.5 times (float) and 6.2 times (double) faster than high-end hyperthreaded 6-core Xeon Code directly for GPU, do not merely adjust CPU code Requires different data and code structures Benefits from different algorithmic modifications An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 47

48 Acknowledgments Collaborators Ricardo Alves (Universidade do Minho, Portugal) Molly O Neil (Texas State University-San Marcos) Keshav Pingali (University of Texas at Austin) Hardware and funding NVIDIA Corporation An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 48

An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm

An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm CHAPTER An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm 6 Martin Burtscher, Keshav Pingali This chapter describes the first CUDA implementation of the classical Barnes Hut

More information

Intermediate Parallel Programming & Cluster Computing

Intermediate Parallel Programming & Cluster Computing High Performance Computing Modernization Program (HPCMP) Summer 2011 Puerto Rico Workshop on Intermediate Parallel Programming & Cluster Computing in conjunction with the National Computational Science

More information

Evaluating the Performance and Energy Efficiency of N-Body Codes on Multi-Core CPUs and GPUs

Evaluating the Performance and Energy Efficiency of N-Body Codes on Multi-Core CPUs and GPUs Evaluating the Performance and Energy Efficiency of N-Body Codes on Multi-Core CPUs and GPUs Ivan Zecena 1, Martin Burtscher 1, Tongdan Jin 2, Ziliang Zong 1 1 Department of Computer Science, Texas State

More information

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as

More information

Nostradamus Grand Cross. - N-body simulation!

Nostradamus Grand Cross. - N-body simulation! Nostradamus Grand Cross - N-body simulation! Presenter Name : Li Dong, Hui Lin! Title/Date : May 7, 2015! Outline! Brief introduction to N-body problem" Grand Cross evolution" Demo" How we did it?!" Sequential

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

Warped parallel nearest neighbor searches using kd-trees

Warped parallel nearest neighbor searches using kd-trees Warped parallel nearest neighbor searches using kd-trees Roman Sokolov, Andrei Tchouprakov D4D Technologies Kd-trees Binary space partitioning tree Used for nearest-neighbor search, range search Application:

More information

Multi Agent Navigation on GPU. Avi Bleiweiss

Multi Agent Navigation on GPU. Avi Bleiweiss Multi Agent Navigation on GPU Avi Bleiweiss Reasoning Explicit Implicit Script, storytelling State machine, serial Compute intensive Fits SIMT architecture well Navigation planning Collision avoidance

More information

Tuning CUDA Applications for Fermi. Version 1.2

Tuning CUDA Applications for Fermi. Version 1.2 Tuning CUDA Applications for Fermi Version 1.2 7/21/2010 Next-Generation CUDA Compute Architecture Fermi is NVIDIA s next-generation CUDA compute architecture. The Fermi whitepaper [1] gives a detailed

More information

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as

More information

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo N-Body Simulation using CUDA CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo Project plan Develop a program to simulate gravitational

More information

Using GPUs to compute the multilevel summation of electrostatic forces

Using GPUs to compute the multilevel summation of electrostatic forces Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Performance impact of dynamic parallelism on different clustering algorithms

Performance impact of dynamic parallelism on different clustering algorithms Performance impact of dynamic parallelism on different clustering algorithms Jeffrey DiMarco and Michela Taufer Computer and Information Sciences, University of Delaware E-mail: jdimarco@udel.edu, taufer@udel.edu

More information

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions. Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication

More information

Numerical Simulation on the GPU

Numerical Simulation on the GPU Numerical Simulation on the GPU Roadmap Part 1: GPU architecture and programming concepts Part 2: An introduction to GPU programming using CUDA Part 3: Numerical simulation techniques (grid and particle

More information

CS377P Programming for Performance GPU Programming - II

CS377P Programming for Performance GPU Programming - II CS377P Programming for Performance GPU Programming - II Sreepathi Pai UTCS November 11, 2015 Outline 1 GPU Occupancy 2 Divergence 3 Costs 4 Cooperation to reduce costs 5 Scheduling Regular Work Outline

More information

Programming in CUDA. Malik M Khan

Programming in CUDA. Malik M Khan Programming in CUDA October 21, 2010 Malik M Khan Outline Reminder of CUDA Architecture Execution Model - Brief mention of control flow Heterogeneous Memory Hierarchy - Locality through data placement

More information

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION Julien Demouth, NVIDIA Cliff Woolley, NVIDIA WHAT WILL YOU LEARN? An iterative method to optimize your GPU code A way to conduct that method with NVIDIA

More information

Warps and Reduction Algorithms

Warps and Reduction Algorithms Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum

More information

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture

More information

Lecture 2: CUDA Programming

Lecture 2: CUDA Programming CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:

More information

Optimizing Parallel Reduction in CUDA

Optimizing Parallel Reduction in CUDA Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf Parallel Reduction Tree-based approach used within each

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam Convolution Soup: A case study in CUDA optimization The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam Optimization GPUs are very fast BUT Naïve programming can result in disappointing performance

More information

Computer Architecture

Computer Architecture Jens Teubner Computer Architecture Summer 2017 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2017 Jens Teubner Computer Architecture Summer 2017 34 Part II Graphics

More information

CUDA Performance Considerations (2 of 2)

CUDA Performance Considerations (2 of 2) Administrivia CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Friday 03/04, 11:59pm Assignment 4 due Presentation date change due via email Not bonus

More information

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA OPTIMIZATIONS ISC 2011 Tutorial CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

FMM implementation on CPU and GPU. Nail A. Gumerov (Lecture for CMSC 828E)

FMM implementation on CPU and GPU. Nail A. Gumerov (Lecture for CMSC 828E) FMM implementation on CPU and GPU Nail A. Gumerov (Lecture for CMSC 828E) Outline Two parts of the FMM Data Structure Flow Chart of the Run Algorithm FMM Cost/Optimization on CPU Programming on GPU Fast

More information

CUDA Performance Optimization. Patrick Legresley

CUDA Performance Optimization. Patrick Legresley CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

Review for Midterm 3/28/11. Administrative. Parts of Exam. Midterm Exam Monday, April 4. Midterm. Design Review. Final projects

Review for Midterm 3/28/11. Administrative. Parts of Exam. Midterm Exam Monday, April 4. Midterm. Design Review. Final projects Administrative Midterm - In class April 4, open notes - Review notes, readings and review lecture (before break) - Will post prior exams Design Review - Intermediate assessment of progress on project,

More information

Advanced CUDA Optimizations. Umar Arshad ArrayFire

Advanced CUDA Optimizations. Umar Arshad ArrayFire Advanced CUDA Optimizations Umar Arshad (@arshad_umar) ArrayFire (@arrayfire) ArrayFire World s leading GPU experts In the industry since 2007 NVIDIA Partner Deep experience working with thousands of customers

More information

Profiling & Tuning Applications. CUDA Course István Reguly

Profiling & Tuning Applications. CUDA Course István Reguly Profiling & Tuning Applications CUDA Course István Reguly Introduction Why is my application running slow? Work it out on paper Instrument code Profile it NVIDIA Visual Profiler Works with CUDA, needs

More information

Scalable Multi Agent Simulation on the GPU. Avi Bleiweiss NVIDIA Corporation San Jose, 2009

Scalable Multi Agent Simulation on the GPU. Avi Bleiweiss NVIDIA Corporation San Jose, 2009 Scalable Multi Agent Simulation on the GPU Avi Bleiweiss NVIDIA Corporation San Jose, 2009 Reasoning Explicit State machine, serial Implicit Compute intensive Fits SIMT well Collision avoidance Motivation

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

Improving Memory Space Efficiency of Kd-tree for Real-time Ray Tracing Byeongjun Choi, Byungjoon Chang, Insung Ihm

Improving Memory Space Efficiency of Kd-tree for Real-time Ray Tracing Byeongjun Choi, Byungjoon Chang, Insung Ihm Improving Memory Space Efficiency of Kd-tree for Real-time Ray Tracing Byeongjun Choi, Byungjoon Chang, Insung Ihm Department of Computer Science and Engineering Sogang University, Korea Improving Memory

More information

General Transformations for GPU Execution of Tree Traversals

General Transformations for GPU Execution of Tree Traversals eneral Transformations for PU Execution of Tree Traversals Michael oldfarb*, Youngjoon Jo**, Milind Kulkarni School of Electrical and Computer Engineering * Now at Qualcomm; ** Now at oogle PU execution

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and

More information

Optimisation Myths and Facts as Seen in Statistical Physics

Optimisation Myths and Facts as Seen in Statistical Physics Optimisation Myths and Facts as Seen in Statistical Physics Massimo Bernaschi Institute for Applied Computing National Research Council & Computer Science Department University La Sapienza Rome - ITALY

More information

Parallelizing Adaptive Triangular Grids with Refinement Trees and Space Filling Curves

Parallelizing Adaptive Triangular Grids with Refinement Trees and Space Filling Curves Parallelizing Adaptive Triangular Grids with Refinement Trees and Space Filling Curves Daniel Butnaru butnaru@in.tum.de Advisor: Michael Bader bader@in.tum.de JASS 08 Computational Science and Engineering

More information

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh

More information

Persistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL

Persistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL (stashing recurrent weights on-chip) Baidu SVAIL April 7, 2016 SVAIL Think hard AI. Goal Develop hard AI technologies that impact 100 million users. Deep Learning at SVAIL 100 GFLOP/s 1 laptop 6 TFLOP/s

More information

MD-CUDA. Presented by Wes Toland Syed Nabeel

MD-CUDA. Presented by Wes Toland Syed Nabeel MD-CUDA Presented by Wes Toland Syed Nabeel 1 Outline Objectives Project Organization CPU GPU GPGPU CUDA N-body problem MD on CUDA Evaluation Future Work 2 Objectives Understand molecular dynamics (MD)

More information

L20: Putting it together: Tree Search (Ch. 6)!

L20: Putting it together: Tree Search (Ch. 6)! Administrative L20: Putting it together: Tree Search (Ch. 6)! November 29, 2011! Next homework, CUDA, MPI (Ch. 3) and Apps (Ch. 6) - Goal is to prepare you for final - We ll discuss it in class on Thursday

More information

Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011

Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011 Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011 Performance Optimization Process Use appropriate performance metric for each kernel For example, Gflops/s don t make sense for

More information

Intersection Acceleration

Intersection Acceleration Advanced Computer Graphics Intersection Acceleration Matthias Teschner Computer Science Department University of Freiburg Outline introduction bounding volume hierarchies uniform grids kd-trees octrees

More information

CUDA Architecture & Programming Model

CUDA Architecture & Programming Model CUDA Architecture & Programming Model Course on Multi-core Architectures & Programming Oliver Taubmann May 9, 2012 Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New

More information

Fast BVH Construction on GPUs

Fast BVH Construction on GPUs Fast BVH Construction on GPUs Published in EUROGRAGHICS, (2009) C. Lauterbach, M. Garland, S. Sengupta, D. Luebke, D. Manocha University of North Carolina at Chapel Hill NVIDIA University of California

More information

CS 179: GPU Programming. Lecture 7

CS 179: GPU Programming. Lecture 7 CS 179: GPU Programming Lecture 7 Week 3 Goals: More involved GPU-accelerable algorithms Relevant hardware quirks CUDA libraries Outline GPU-accelerated: Reduction Prefix sum Stream compaction Sorting(quicksort)

More information

CUB. collective software primitives. Duane Merrill. NVIDIA Research

CUB. collective software primitives. Duane Merrill. NVIDIA Research CUB collective software primitives Duane Merrill NVIDIA Research What is CUB?. A design model for collective primitives How to make reusable SIMT software constructs. A library of collective primitives

More information

Automatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC

Automatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC Automatic Compiler-Based Optimization of Graph Analytics for the GPU Sreepathi Pai The University of Texas at Austin May 8, 2017 NVIDIA GTC Parallel Graph Processing is not easy 299ms HD-BFS 84ms USA Road

More information

QR Decomposition on GPUs

QR Decomposition on GPUs QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of

More information

Morph Algorithms on GPUs

Morph Algorithms on GPUs Morph Algorithms on GPUs Rupesh Nasre 1 Martin Burtscher 2 Keshav Pingali 1,3 1 Inst. for Computational Engineering and Sciences, University of Texas at Austin, USA 2 Dept. of Computer Science, Texas State

More information

Analysis-driven Engineering of Comparison-based Sorting Algorithms on GPUs

Analysis-driven Engineering of Comparison-based Sorting Algorithms on GPUs AlgoPARC Analysis-driven Engineering of Comparison-based Sorting Algorithms on GPUs 32nd ACM International Conference on Supercomputing June 17, 2018 Ben Karsin 1 karsin@hawaii.edu Volker Weichert 2 weichert@cs.uni-frankfurt.de

More information

CSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University CSE 591: GPU Programming Using CUDA in Practice Klaus Mueller Computer Science Department Stony Brook University Code examples from Shane Cook CUDA Programming Related to: score boarding load and store

More information

Review: Creating a Parallel Program. Programming for Performance

Review: Creating a Parallel Program. Programming for Performance Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)

More information

Lecture 11: GPU programming

Lecture 11: GPU programming Lecture 11: GPU programming David Bindel 4 Oct 2011 Logistics Matrix multiply results are ready Summary on assignments page My version (and writeup) on CMS HW 2 due Thursday Still working on project 2!

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh GPU Performance Optimisation EPCC The University of Edinburgh Hardware NVIDIA accelerated system: Memory Memory GPU vs CPU: Theoretical Peak capabilities NVIDIA Fermi AMD Magny-Cours (6172) Cores 448 (1.15GHz)

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

Introduction to CUDA Programming

Introduction to CUDA Programming Introduction to CUDA Programming Steve Lantz Cornell University Center for Advanced Computing October 30, 2013 Based on materials developed by CAC and TACC Outline Motivation for GPUs and CUDA Overview

More information

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60 1 / 60 GPU Programming Performance Considerations Miaoqing Huang University of Arkansas Fall 2013 2 / 60 Outline Control Flow Divergence Memory Coalescing Shared Memory Bank Conflicts Occupancy Loop Unrolling

More information

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization Duksu Kim Assistant professor, KORATEHC Education Ph.D. Computer Science, KAIST Parallel Proximity Computation on Heterogeneous Computing Systems for Graphics Applications Professional Experience Senior

More information

GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC

GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC MIKE GOWANLOCK NORTHERN ARIZONA UNIVERSITY SCHOOL OF INFORMATICS, COMPUTING & CYBER SYSTEMS BEN KARSIN UNIVERSITY OF HAWAII AT MANOA DEPARTMENT

More information

Exploiting graphical processing units for data-parallel scientific applications

Exploiting graphical processing units for data-parallel scientific applications CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2009; 21:2400 2437 Published online 20 July 2009 in Wiley InterScience (www.interscience.wiley.com)..1462 Exploiting

More information

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA GPGPU LAB Case study: Finite-Difference Time- Domain Method on CUDA Ana Balevic IPVS 1 Finite-Difference Time-Domain Method Numerical computation of solutions to partial differential equations Explicit

More information

Optimization of Tele-Immersion Codes

Optimization of Tele-Immersion Codes Optimization of Tele-Immersion Codes Albert Sidelnik, I-Jui Sung, Wanmin Wu, María Garzarán, Wen-mei Hwu, Klara Nahrstedt, David Padua, Sanjay Patel University of Illinois at Urbana-Champaign 1 Agenda

More information

Scan Primitives for GPU Computing

Scan Primitives for GPU Computing Scan Primitives for GPU Computing Agenda What is scan A naïve parallel scan algorithm A work-efficient parallel scan algorithm Parallel segmented scan Applications of scan Implementation on CUDA 2 Prefix

More information

B. Tech. Project Second Stage Report on

B. Tech. Project Second Stage Report on B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic

More information

CSE 160 Lecture 24. Graphical Processing Units

CSE 160 Lecture 24. Graphical Processing Units CSE 160 Lecture 24 Graphical Processing Units Announcements Next week we meet in 1202 on Monday 3/11 only On Weds 3/13 we have a 2 hour session Usual class time at the Rady school final exam review SDSC

More information

GPU Sparse Graph Traversal

GPU Sparse Graph Traversal GPU Sparse Graph Traversal Duane Merrill (NVIDIA) Michael Garland (NVIDIA) Andrew Grimshaw (Univ. of Virginia) UNIVERSITY of VIRGINIA Breadth-first search (BFS) 1. Pick a source node 2. Rank every vertex

More information

Advanced CUDA Optimizations

Advanced CUDA Optimizations Advanced CUDA Optimizations General Audience Assumptions General working knowledge of CUDA Want kernels to perform better Profiling Before optimizing, make sure you are spending effort in correct location

More information

General Transformations for GPU Execution of Tree Traversals

General Transformations for GPU Execution of Tree Traversals General Transformations for GPU Execution of Tree Traversals Michael Goldfarb, Youngjoon Jo and Milind Kulkarni School of Electrical and Computer Engineering Purdue University West Lafayette, IN {mgoldfar,

More information

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU

More information

A Detailed GPU Cache Model Based on Reuse Distance Theory

A Detailed GPU Cache Model Based on Reuse Distance Theory A Detailed GPU Cache Model Based on Reuse Distance Theory Cedric Nugteren, Gert-Jan van den Braak, Henk Corporaal Eindhoven University of Technology (Netherlands) Henri Bal Vrije Universiteit Amsterdam

More information

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam Convolution Soup: A case study in CUDA optimization The Fairmont San Jose Joe Stam Optimization GPUs are very fast BUT Poor programming can lead to disappointing performance Squeaking out the most speed

More information

GPU Fundamentals Jeff Larkin November 14, 2016

GPU Fundamentals Jeff Larkin November 14, 2016 GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate

More information

Performance Optimization Part II: Locality, Communication, and Contention

Performance Optimization Part II: Locality, Communication, and Contention Lecture 7: Performance Optimization Part II: Locality, Communication, and Contention Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Beth Rowley Nobody s Fault but Mine

More information

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA 2006: Short-Range Molecular Dynamics on GPU San Jose, CA September 22, 2010 Peng Wang, NVIDIA Overview The LAMMPS molecular dynamics (MD) code Cell-list generation and force calculation Algorithm & performance

More information

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018 S8630 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS Jakob Progsch, Mathias Wagner GTC 2018 1. Know your hardware BEFORE YOU START What are the target machines, how many nodes? Machine-specific

More information

Advanced CUDA Optimization 1. Introduction

Advanced CUDA Optimization 1. Introduction Advanced CUDA Optimization 1. Introduction Thomas Bradley Agenda CUDA Review Review of CUDA Architecture Programming & Memory Models Programming Environment Execution Performance Optimization Guidelines

More information

CMSC 858M/AMSC 698R. Fast Multipole Methods. Nail A. Gumerov & Ramani Duraiswami. Lecture 20. Outline

CMSC 858M/AMSC 698R. Fast Multipole Methods. Nail A. Gumerov & Ramani Duraiswami. Lecture 20. Outline CMSC 858M/AMSC 698R Fast Multipole Methods Nail A. Gumerov & Ramani Duraiswami Lecture 20 Outline Two parts of the FMM Data Structures FMM Cost/Optimization on CPU Fine Grain Parallelization for Multicore

More information

Improving Performance of Machine Learning Workloads

Improving Performance of Machine Learning Workloads Improving Performance of Machine Learning Workloads Dong Li Parallel Architecture, System, and Algorithm Lab Electrical Engineering and Computer Science School of Engineering University of California,

More information

NVIDIA GPU CODING & COMPUTING

NVIDIA GPU CODING & COMPUTING NVIDIA GPU CODING & COMPUTING WHY GPU S? ARCHITECTURE & PROGRAM MODEL CPU v. GPU Multiprocessor Model Memory Model Memory Model: Thread Level Programing Model: Logical Mapping of Threads Programing Model:

More information

Universiteit Leiden Opleiding Informatica

Universiteit Leiden Opleiding Informatica Universiteit Leiden Opleiding Informatica GPU acceleration of arbitrary precision N-body code Name: Terry Zabel Date: January 9, 2018 1st supervisor: 2nd supervisor: Prof. Dr. Aske Plaat Prof. Dr. Simon

More information

GPU & Reproductibility, predictability, Repetability, Determinism.

GPU & Reproductibility, predictability, Repetability, Determinism. GPU & Reproductibility, predictability, Repetability, Determinism. David Defour DALI/LIRMM, UPVD Phd Thesis proposal Subject Study the {numerical} predictability of GPUs Why?... fuzzy definition but useful

More information

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Fundamental Optimizations (GTC 2010) Paulius Micikevicius NVIDIA Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Optimization

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION WHAT YOU WILL LEARN An iterative method to optimize your GPU code Some common bottlenecks to look out for Performance diagnostics with NVIDIA Nsight

More information

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization

More information

GPU Programming. Parallel Patterns. Miaoqing Huang University of Arkansas 1 / 102

GPU Programming. Parallel Patterns. Miaoqing Huang University of Arkansas 1 / 102 1 / 102 GPU Programming Parallel Patterns Miaoqing Huang University of Arkansas 2 / 102 Outline Introduction Reduction All-Prefix-Sums Applications Avoiding Bank Conflicts Segmented Scan Sorting 3 / 102

More information

Treelogy: A Benchmark Suite for Tree Traversals

Treelogy: A Benchmark Suite for Tree Traversals Purdue University Programming Languages Group Treelogy: A Benchmark Suite for Tree Traversals Nikhil Hegde, Jianqiao Liu, Kirshanthan Sundararajah, and Milind Kulkarni School of Electrical and Computer

More information

CUDA Programming Model

CUDA Programming Model CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming

More information

Supporting Data Parallelism in Matcloud: Final Report

Supporting Data Parallelism in Matcloud: Final Report Supporting Data Parallelism in Matcloud: Final Report Yongpeng Zhang, Xing Wu 1 Overview Matcloud is an on-line service to run Matlab-like script on client s web browser. Internally it is accelerated by

More information