An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm. Martin Burtscher Department of Computer Science Texas State University-San Marcos

Similar documents
An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm

Intermediate Parallel Programming & Cluster Computing

Evaluating the Performance and Energy Efficiency of N-Body Codes on Multi-Core CPUs and GPUs

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Nostradamus Grand Cross. - N-body simulation!

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Warped parallel nearest neighbor searches using kd-trees

Multi Agent Navigation on GPU. Avi Bleiweiss

Tuning CUDA Applications for Fermi. Version 1.2

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo

Using GPUs to compute the multilevel summation of electrostatic forces

High Performance Computing on GPUs using NVIDIA CUDA

Performance impact of dynamic parallelism on different clustering algorithms

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

Numerical Simulation on the GPU

CS377P Programming for Performance GPU Programming - II

Programming in CUDA. Malik M Khan

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

Warps and Reduction Algorithms

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Lecture 2: CUDA Programming

Optimizing Parallel Reduction in CUDA

Fundamental CUDA Optimization. NVIDIA Corporation

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

Computer Architecture

CUDA Performance Considerations (2 of 2)

CUDA OPTIMIZATIONS ISC 2011 Tutorial

FMM implementation on CPU and GPU. Nail A. Gumerov (Lecture for CMSC 828E)

CUDA Performance Optimization. Patrick Legresley

Fundamental CUDA Optimization. NVIDIA Corporation

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Review for Midterm 3/28/11. Administrative. Parts of Exam. Midterm Exam Monday, April 4. Midterm. Design Review. Final projects

Advanced CUDA Optimizations. Umar Arshad ArrayFire

Profiling & Tuning Applications. CUDA Course István Reguly

Scalable Multi Agent Simulation on the GPU. Avi Bleiweiss NVIDIA Corporation San Jose, 2009

Tesla Architecture, CUDA and Optimization Strategies

Improving Memory Space Efficiency of Kd-tree for Real-time Ray Tracing Byeongjun Choi, Byungjoon Chang, Insung Ihm

General Transformations for GPU Execution of Tree Traversals

Introduction to GPU hardware and to CUDA

CME 213 S PRING Eric Darve

Optimisation Myths and Facts as Seen in Statistical Physics

Parallelizing Adaptive Triangular Grids with Refinement Trees and Space Filling Curves

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

Persistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL

MD-CUDA. Presented by Wes Toland Syed Nabeel

L20: Putting it together: Tree Search (Ch. 6)!

Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011

Intersection Acceleration

CUDA Architecture & Programming Model

Fast BVH Construction on GPUs

CS 179: GPU Programming. Lecture 7

CUB. collective software primitives. Duane Merrill. NVIDIA Research

Automatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC

QR Decomposition on GPUs

Morph Algorithms on GPUs

Analysis-driven Engineering of Comparison-based Sorting Algorithms on GPUs

CSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University

Review: Creating a Parallel Program. Programming for Performance

Lecture 11: GPU programming

TUNING CUDA APPLICATIONS FOR MAXWELL

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh

Portland State University ECE 588/688. Graphics Processors

Introduction to CUDA Programming

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC

Exploiting graphical processing units for data-parallel scientific applications

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA

Optimization of Tele-Immersion Codes

Scan Primitives for GPU Computing

B. Tech. Project Second Stage Report on

CSE 160 Lecture 24. Graphical Processing Units

GPU Sparse Graph Traversal

Advanced CUDA Optimizations

General Transformations for GPU Execution of Tree Traversals

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

A Detailed GPU Cache Model Based on Reuse Distance Theory

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam

GPU Fundamentals Jeff Larkin November 14, 2016

Performance Optimization Part II: Locality, Communication, and Contention

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

Advanced CUDA Optimization 1. Introduction

CMSC 858M/AMSC 698R. Fast Multipole Methods. Nail A. Gumerov & Ramani Duraiswami. Lecture 20. Outline

Improving Performance of Machine Learning Workloads

NVIDIA GPU CODING & COMPUTING

Universiteit Leiden Opleiding Informatica

GPU & Reproductibility, predictability, Repetability, Determinism.

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

TUNING CUDA APPLICATIONS FOR MAXWELL

General Purpose GPU Computing in Partial Wave Analysis

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU Programming. Parallel Patterns. Miaoqing Huang University of Arkansas 1 / 102

Treelogy: A Benchmark Suite for Tree Traversals

CUDA Programming Model

Supporting Data Parallelism in Matcloud: Final Report

Transcription:

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm Martin Burtscher Department of Computer Science Texas State University-San Marcos

Mapping Regular Code to GPUs Regular codes Operate on array- and matrix-based data structures Exhibit mostly strided memory access patterns Have relatively predictable control flow (control flow behavior is largely determined by input size) Many regular codes are easy to port to GPUs E.g., matrix codes executing many ops/word Dense matrix operations (level 2 and 3 BLAS) Stencil codes (PDE solvers) LLNL An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 2

Mapping Irregular Code to GPUs Irregular codes Build, traverse, and update dynamic data structures (e.g., trees, graphs, linked lists, and priority queues) Exhibit data dependent memory access patterns Have complex control flow (control flow behavior depends on input values and changes dynamically) Many important scientific programs are irregular E.g., data clustering, SAT solving, social networks, meshing, Need case studies to learn how to best map irregular codes FSU An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 3

Example: N-Body Simulation Irregular Barnes Hut algorithm Repeatedly builds unbalanced tree & performs complex traversals on it Our implementation Designed for GPUs (not just port of CPU code) First GPU implementation of entire BH algorithm Results GPU is 21 times faster than CPU (6 cores) on this code GPU has better architecture for this irregular algorithm NVIDIA GPU Gems book chapter (2011) An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 4

Outline Introduction Barnes Hut algorithm CUDA implementation Experimental results Conclusions NASA/JPL-Caltech/SSC An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 5

N-Body Simulation Time evolution of physical system System consists of bodies n is the number of bodies Bodies interact via pair-wise forces RUG Many systems can be modeled in this way Star/galaxy clusters (gravitational force) Particles (electric force, magnetic force) Cornell An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 6

Simple O(n 2 ) n-body Algorithm Algorithm Initialize body masses, positions, and velocities Iterate over time steps { Accumulate forces acting on each body Update body positions and velocities based on force } Output result More sophisticated n-body algorithms exist Barnes Hut algorithm Fast Multipole Method (FMM) An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 7

Barnes Hut Idea Precise force calculation Requires O(n 2 ) operations (O(n 2 ) body pairs) Computationally intractable for large n Barnes and Hut (1986) Algorithm to approximately compute forces Bodies initial position and velocity are also approximate Requires only O(n log n) operations Idea is to combine far away bodies Error should be small because force is proportional to 1/distance 2 An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 8

Barnes Hut Algorithm Set bodies initial position and velocity Iterate over time steps 1. Compute bounding box around bodies 2. Subdivide space until at most one body per cell Record this spatial hierarchy in an octree 3. Compute mass and center of mass of each cell 4. Compute force on bodies by traversing octree Stop traversal path when encountering a leaf (body) or an internal node (cell) that is far enough away 5. Update each body s position and velocity An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 9

Build Tree (Level 1) o Compute bounding box around all bodies tree root An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 10

Build Tree (Level 2) o Subdivide space until at most one body per cell An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 11

Build Tree (Level 3) o Subdivide space until at most one body per cell An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 12

Build Tree (Level 4) o Subdivide space until at most one body per cell An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 13

Build Tree (Level 5) o Subdivide space until at most one body per cell An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 14

Compute Cells Center of Mass o o o For each internal cell, compute sum of mass and weighted average of position of all bodies in subtree; example shows two cells only An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 15

Compute Forces o o o Compute force, for example, acting upon green body An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 16

Compute Force (short distance) o o o Scan tree depth first from left to right; green portion already completed An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 17

Compute Force (down one level) o o o Red center of mass is too close, need to go down one level An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 18

Compute Force (long distance) o o o Blue center of mass is far enough away An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 19

Compute Force (skip subtree) o o o Therefore, entire subtree rooted in the blue cell can be skipped An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 20

Pseudocode bodyset =... foreach timestep do { bounding_box = new Bounding_Box(); foreach Body b in bodyset { bounding_box.include(b); } octree = new Octree(bounding_box); foreach Body b in bodyset { octree.insert(b); } celllist = octree.cellsbylevel(); foreach Cell c in celllist { c.summarize(); } foreach Body b in bodyset { b.computeforce(octree); } foreach Body b in bodyset { b.advance(); } } o o o o o An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 21

Complexity and Parallelism bodyset =... foreach timestep do { // O(n log n) + fully ordered sequential bounding_box = new Bounding_Box(); foreach Body b in bodyset { // O(n) parallel reduction bounding_box.include(b); } octree = new Octree(bounding_box); foreach Body b in bodyset { // O(n log n) top-down tree building octree.insert(b); } celllist = octree.cellsbylevel(); foreach Cell c in celllist { // O(n) + partially ordered bottom-up traversal c.summarize(); } foreach Body b in bodyset { // O(n log n) fully parallel b.computeforce(octree); } foreach Body b in bodyset { // O(n) fully parallel b.advance(); } } An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 22

Outline Introduction Barnes Hut algorithm CUDA implementation Experimental results Conclusions An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 23

Efficient GPU Code Coalesced main memory accesses Little thread divergence Enough threads per block Not too many registers per thread Not too much shared memory usage Enough (independent) blocks Little synchronization between blocks Little CPU/GPU data transfer Efficient use of shared memory Thepcreport.net An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 24

Main BH Implementation Challenges Uses irregular tree-based data structure Load imbalance Little coalescing Complex recursive traversals Recursion not allowed Lots of thread divergence Memory-bound pointer-chasing operations Not enough computation to hide latency o An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 25

Six GPU Kernels Read initial data and transfer to GPU for each timestep do { 1. Compute bounding box around bodies 2. Build hierarchical decomposition, i.e., octree 3. Summarize body information in internal octree nodes 4. Approximately sort bodies by spatial location (optional) 5. Compute forces acting on each body with help of octree 6. Update body positions and velocities } Transfer result from GPU and output An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 26

Global Optimizations Make code iterative (recursion not supported) Keep data on GPU between kernel calls Use array elements instead of heap nodes One aligned array per field for coalesced accesses objects in array objects on heap fields in arrays An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 27

Global Optimizations (cont.) Maximize thread count (round down to warp size) Maximize resident block count (all SMs filled) Pass kernel parameters through constant memory Use special allocation order Alias arrays (56 B/node) Use index arithmetic Persistent blocks & threads Unroll loops over children c0 c2 c4 c1 b5 ba c3 b6 b2 b7 b0 c5 b3 b1 b8 b4 b9 bodies (fixed) cell allocation direction b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 ba... c5 c4 c3 c2 c1 c0 An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 28

Kernel 1: Bounding Box (Regular) main memory threads shared memory threads shared memory threads shared memory threads shared memory threads shared memory warp 1 warp 2 warp 3 warp 4 warp 1 warp 2 warp 1 t1 t2 t1 barrier barrier barrier barrier Reduction operation... Optimizations Fully coalesced Fully cached No bank conflicts Minimal divergence Built-in min and max 2 red/mem, 6 red/bar Bodies load balanced 5123 threads per SM An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 29

Kernel 2: Build Octree (Irregular) Optimizations Only lock leaf pointers Lock-free fast path Light-weight lock release on slow path No re-traverse after lock acquire failure Combined memory fence per block Re-compute position during traversal Separate init kernels for coalescing 5123 threads per SM Top-down tree building An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 30

Kernel 2: Build Octree (cont.) // initialize cell = find_insertion_point(body); // no locks, cache cell child = get_insertion_index(cell, body); if (child!= locked) { // skip atomic if already locked if (child == null) { // fast path (frequent) if (null == atomiccas(&cell[child], null, body)) { // lock-free insertion // move on to next body } } else { // slow path (first part) if (child == atomiccas(&cell[child], child, lock)) { // acquire lock // build subtree with new and existing body flag = true; } } } syncthreads(); // barrier if (threadidx == 0) threadfence(); // push data out if L1 cache syncthreads(); // barrier if (flag) { // slow path (second part) cell[child] = new_subtree; // insert subtree and release lock // move on to next body } An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 31

Kernel 3: Summarize Subtrees (Irregular) Bottom-up tree traversal 4 3 1 2 3 4... allocation direction 3 4 1 2 3 4 scan direction Optimizations Scan avoids deadlock Use mass as flag + fence No locks, no atomics Use wait-free first pass Cache the ready information Piggyback on traversal Count bodies in subtrees No parent pointers 1286 threads per SM An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 32

Kernel 4: Sort Bodies (Irregular) Top-down tree traversal 4 3 1 2 3 4... allocation direction 3 4 1 2 3 4 scan direction Optimizations (Similar to Kernel 3) Scan avoids deadlock Use data field as flag No locks, no atomics Use counts from Kernel 3 Piggyback on traversal Move nulls to back Throttle flag polling requests with optional barrier 646 threads per SM An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 33

Kernel 5: Force Calculation (Irregular) Multiple prefix traversals Optimizations Group similar work together Uses sorting to minimize size of prefix union in each warp Early out (nulls in back) Traverse whole union to avoid divergence (warp voting) Lane 0 controls iteration stack for entire warp (fits in shared mem) Avoid unneeded volatile accesses Cache tree-level-based data 2565 threads per SM An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 34

Architectural Support Coalesced memory accesses & lockstep execution All threads in warp read same tree node at same time Only one mem access per warp instead of 32 accesses Warp-based execution Enables data sharing in warps without synchronization RSQRTF instruction Quickly computes good approximation of 1/sqrtf(x) Warp voting instructions Quickly perform reduction operations within a warp An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 35

Kernel 6: Advance Bodies (Regular) Optimizations Fully coalesced, no divergence Load balanced, 10241 threads per SM Straightforward streaming main memory... threads warp 1 warp 2 warp 3 warp 4 warp 1 warp 2... warp 2 main memory... An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 36

Outline Introduction Barnes Hut algorithm CUDA implementation Experimental results Conclusions 10000000 1000000 100000 10000 1000 100 10 1 An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 37

Evaluation Methodology Implementations CUDA: irregular Barnes Hut & regular O(n 2 ) algorithm OpenMP: Barnes Hut algorithm (derived from CUDA code) Pthreads: Barnes Hut algorithm (from SPLASH-2 suite) Systems and compilers nvcc 4.0 (-O3 -arch=sm_20 -ftz=true) GeForce GTX 480, 1.4 GHz, 15 SMs, 32 cores per SM gcc 4.1.2 (-O3 -fopenmp -ffast-math) Xeon X5690, 3.46 GHz, 6 cores, 2 threads per core Inputs and metric 5k, 50k, 500k, and 5M star clusters (Plummer model) Best runtime of three experiments, excluding I/O An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 38

Nodes Touched per Activity (5M Input) Kernel activities K1: pair reduction K2: tree insertion K3: bottom-up step K4: top-down step K5: prefix traversal K6: integration step Max tree depth 22 Cells have 3.1 children neighborhood size min avg max kernel 1 1 2.0 2 kernel 2 2 13.2 22 kernel 3 2 4.1 9 kernel 4 2 4.1 9 kernel 5 818 4,117.0 6,315 kernel 6 1 1.0 1 Prefix 6,315 nodes ( 0.1% of 7.4M) BH algorithm & sorting to minimize size of prefix union work well An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 39

k1.1 k1.2 k1.3 k1.4 k1.5 k1.6 k1.7 k1.8 k1.9 k1.10 k1.11 k1.12 k1.13 k1.14 k1.15 k1.16 k1.17 k1.18 k1.19 k1.20 k1.21 k1.22 k1.23 k2.1 k2.2 k2.3 k2.4 k2.5 k2.6 k2.7 k2.8 k2.9 k2.10 k2.11 k2.12 k2.13 k2.14 k2.15 k2.16 k2.17 k2.18 k2.19 k3.1 k3.2 k3.3 k3.4 k3.5 k3.6 k3.7 k3.8 k3.9 k3.10 k3.11 k3.12 k3.13 k3.14 k3.15 k3.16 k3.17 k3.18 k3.19 k3.20 k4.1 k4.2 k4.3 k4.4 k4.5 k4.6 k4.7 k4.8 k4.9 k4.10 k4.11 k4.12 k4.13 k4.14 k4.15 k4.16 k4.17 k4.18 k4.19 k4.20 k5.1 k6.1 available parallelism [independent activities] force calc. integration 10,000,000 Available Amorphous Data Parallelism bounding box tree building summarization sorting 1,000,000 100,000 10,000 5,000 bodies 50,000 bodies 500,000 bodies 5,000,000 bodies 1,000 100 10 1 kernel number and iteration An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 40

runtime per timestep [ms] Runtime Comparison GPU BH inefficiency 5k input too small for 5,760 to 23,040 threads BH vs. O(n 2 ) algorithm Regular O(n 2 ) code is faster with fewer than about 15,000 bodies GPU vs. CPU (5M input) 21.1x faster than OpenMP 23.2x faster than Pthreads 100,000 10,000 1,000 100 10 1 0 GPU CUDA GPU O(n^2) CPU OpenMP CPU Pthreads 5k 50k 500k 5M number of bodies An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 41

Kernel Performance for 5M Input $400 GPU delivers 228 GFlops/s on irregular code kernel 1 kernel 2 kernel 3 kernel 4 kernel 5 kernel 6 BarnesHut O(n 2 ) Gflops/s 71.6 5.8 2.5 n/a 240.6 33.5 228.4 897.0 GB/s 142.9 26.8 10.6 12.8 8.0 133.9 8.8 2.8 runtime [ms] 0.4 44.6 28.0 14.2 1641.2 2.2 1730.6 557421.5 GPU chip is 2.7 to 23.5 times faster than CPU chip non-compliant fast single-precision version IEEE 754-compliant double-precision version kernel 1 kernel 2 kernel 3 kernel 4 kernel 5 kernel 6 kernel 1 kernel 2 kernel 3 kernel 4 kernel 5 kernel 6 X5690 CPU 5.5 185.7 75.8 52.1 38,540.3 16.4 10.3 193.1 101.0 51.6 47,706.4 33.1 GTX 480 GPU 0.4 44.6 28.0 14.2 1,641.2 2.2 0.8 46.7 31.0 14.2 7,714.6 4.2 CPU/GPU 13.1 4.2 2.7 3.7 23.5 7.3 12.7 4.1 3.3 3.6 6.2 7.9 GPU hardware is better suited for running BH than CPU hw is But more difficult and time consuming to program An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 42

Kernel Speedups Optimizations that are generally applicable avoid rsqrtf recalc. thread full multivolatile instr. data voting threading 50,000 1.14x 1.43x 0.99x 2.04x 20.80x 500,000 1.19x 1.47x 1.32x 2.49x 27.99x 5,000,000 1.18x 1.46x 1.69x 2.47x 28.85x Optimizations for irregular kernels throttling waitfree combined sorting of sync'ed barrier pre-pass mem fence bodies execution 50,000 0.97x 1.02x 1.54x 3.60x 6.23x 500,000 1.03x 1.21x 1.57x 6.28x 8.04x 5,000,000 1.04x 1.31x 1.50x 8.21x 8.60x An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 43

Outline Introduction Barnes Hut algorithm CUDA implementation Experimental results Conclusions An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 44

Optimization Summary Minimize thread divergence Group similar work together, force synchronicity Reduce main memory accesses Share data within warp, combine memory fences & traversals, re-compute data, avoid volatile accesses Implement entire algorithm on and for GPU Avoid data transfers & data structure inefficiencies, wait-free pre-pass, scan entire prefix union An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 45

Optimization Summary (cont.) Exploit hardware features Fast synchronization and thread startup, special instructions, coalesced memory accesses, even lockstep execution Use light-weight locking and synchronization Minimize locks, reuse fields, use fence + store ops Maximize parallelism Parallelize every step within and across SMs An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 46

Conclusions Irregularity does not necessarily prevent high-performance on GPUs Entire Barnes Hut algorithm implemented on GPU Builds and traverses unbalanced octree GPU is 22.5 times (float) and 6.2 times (double) faster than high-end hyperthreaded 6-core Xeon Code directly for GPU, do not merely adjust CPU code Requires different data and code structures Benefits from different algorithmic modifications An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 47

Acknowledgments Collaborators Ricardo Alves (Universidade do Minho, Portugal) Molly O Neil (Texas State University-San Marcos) Keshav Pingali (University of Texas at Austin) Hardware and funding NVIDIA Corporation An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 48