Scalable Multi Agent Simulation on the GPU. Avi Bleiweiss NVIDIA Corporation San Jose, 2009

Similar documents
Multi Agent Navigation on GPU. Avi Bleiweiss

Parallel Compact Roadmap Construction of 3D Virtual Environments on the GPU

Multi Agent Navigation on the GPU

Real-time Path Planning and Navigation for Multi-Agent and Heterogeneous Crowd Simulation

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Using GPUs to compute the multilevel summation of electrostatic forces

High Performance Computing on GPUs using NVIDIA CUDA

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

CS377P Programming for Performance GPU Programming - II

Fundamental CUDA Optimization. NVIDIA Corporation

Numerical Simulation on the GPU

high performance medical reconstruction using stream programming paradigms

Using Virtual Texturing to Handle Massive Texture Data

Fundamental CUDA Optimization. NVIDIA Corporation

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

CUDA OPTIMIZATIONS ISC 2011 Tutorial

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

Persistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

CUDA Architecture & Programming Model

Introduction to GPGPU and GPU-architectures

CUDA Programming Model

CS 179 Lecture 4. GPU Compute Architecture

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Portland State University ECE 588/688. Graphics Processors

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh

Lecture 2: CUDA Programming

Parallel Programming on Larrabee. Tim Foley Intel Corp

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC

Profiling & Tuning Applications. CUDA Course István Reguly

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Collaborators. Multiple Agents & Crowd Simulation: training sytems 5/15/2010. Interactive Multi-Robot Planning and Multi-Agent Simulation

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

Introduction to Parallel Programming Models

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Cartoon parallel architectures; CPUs and GPUs

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Stream Processing with CUDA TM A Case Study Using Gamebryo's Floodgate Technology

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

From Biological Cells to Populations of Individuals: Complex Systems Simulations with CUDA (S5133)

GRAPHICS PROCESSING UNITS

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Computer Architecture

Tesla Architecture, CUDA and Optimization Strategies

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam

Double-Precision Matrix Multiply on CUDA

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

CS179 GPU Programming Introduction to CUDA. Lecture originally by Luke Durant and Tamas Szalay

Advanced CUDA Optimization 1. Introduction

CME 213 S PRING Eric Darve

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

GPU-based Distributed Behavior Models with CUDA

RT 3D FDTD Simulation of LF and MF Room Acoustics

Parallelising Pipelined Wavefront Computations on the GPU

Hybrid Implementation of 3D Kirchhoff Migration

OpenCL Implementation Of A Heterogeneous Computing System For Real-time Rendering And Dynamic Updating Of Dense 3-d Volumetric Data

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Physics Parallelization. Erwin Coumans SCEA

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm. Martin Burtscher Department of Computer Science Texas State University-San Marcos

Reciprocal Collision Avoidance and Navigation for Video Games

GPU Fundamentals Jeff Larkin November 14, 2016

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

A Cross-Input Adaptive Framework for GPU Program Optimizations

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Parallel Methods for Verifying the Consistency of Weakly-Ordered Architectures. Adam McLaughlin, Duane Merrill, Michael Garland, and David A.

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

GAME PROGRAMMING ON HYBRID CPU-GPU ARCHITECTURES TAKAHIRO HARADA, AMD DESTRUCTION FOR GAMES ERWIN COUMANS, AMD

gem5-gpu Extending gem5 for GPGPUs Jason Power, Marc Orr, Joel Hestness, Mark Hill, David Wood

Advanced CUDA Optimizing to Get 20x Performance. Brent Oster

Optimisation Myths and Facts as Seen in Statistical Physics

Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA

CS377P Programming for Performance GPU Programming - I

Accelerated Machine Learning Algorithms in Python

Deformable and Fracturing Objects

Introduction to CUDA Programming

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

CSE 160 Lecture 24. Graphical Processing Units

Introduction to GPU hardware and to CUDA

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

Performance Optimization Part II: Locality, Communication, and Contention

Programmable Graphics Hardware (GPU) A Primer

DiFi: Distance Fields - Fast Computation Using Graphics Hardware

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors

Performance potential for simulating spin models on GPU

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Transcription:

Scalable Multi Agent Simulation on the GPU Avi Bleiweiss NVIDIA Corporation San Jose, 2009

Reasoning Explicit State machine, serial Implicit Compute intensive Fits SIMT well Collision avoidance

Motivation Games re-designed Casual, bursty Game Scalable, real time Dense environments Joint framework AI CUDA TM PhysX TM Compute API Graphics API DirectX AI, physics simulation Driver

Problem Planner Efficient mesh to roadmap conversion. Searches a global, optimal path From start to goal Locally, avoids collisions with Static, dynamic objects Simulator Visually compelling motion Economical memory footprint A subset of compute units Linear scale with # characters

Solution Small, quality roadmap Heterogeneous agents Velocity Obstacles GPU specific optimizations Spatial hash Nested parallel

Outline Algorithm Implementation Results Integration Takeaways

Pipeline Game level mesh input Inline computed roadmap Goals, roadmap decoupled Discrete time simulation Intuitive multi threading Level Roadmap Transform A* Search Collision Avoidance Velocity Goals State

Roadmap Transform Reachability roadmap 1 An existed C free path Guaranteed in roadmap Predictable termination Highly parallelizable 3D grid operators Level Voxel Distance Medial Axis Flood Fill Connectivity Roadmap 1 [Geraerts and Overmars 2005]

Visibility Two sets of edges Visible roadmap node pairs Goals to unblocked nodes Static obstacles outline A* search, shortest path From goal to any node node goal obstacle

Velocity Obstacles Well defined, widely used Avoidance velocity set 1 Reciprocal Velocity Obstacles 2 Oscillation free motion Agents moving in 2D plane 1 [Fiorini and Shiller 1998] 2 [Van Den Berg et al. 2008] A B B A pb va vb pa vb ( vb) va ( vb) RVO A B va v B 2 ( vb, va) VO A ( v B B )

Simulation Simulator advances until All agents reached goal Path realigned towards Roadmap node or goal Agent, velocity parallel do hash construct hash table simulate compute preferred velocity compute proximity scope foreach velocity sample do foreach neighbor do if OBSTACLE then VO elseif AGENT then RVO resolve new velocity update update position, velocity nested resolve at-goal while not all-at-goal flat

Workflow Deterministic resources Linear, pitched 3D Roadmap static for Multiple simulation steps Dozen compute kernels Split frame, multi GPU CPU Loop Control Hash Build table Simulator GPU Physics (Graphics) Collision Avoidance Frame Process

Challenges Hiding memory latency Divergent, irregular threads Small agent count ( 32) Hash construction cost

Distance Transform Input, binary grid of cells Squared Euclidian distance Serial running time Parallel linear time 3 n O( n 3 ) Slice, column, row passes n 2 GPU threads, per pass 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 1 1 1 1 0 0 0 1 1 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 O(n) 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 1 4 4 4 0 0 0 4 9 1 9 1 0 0 1 4 0 4 4 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 1 4 4 1 0 0 0 1 4 1 4 1 0 0 1 1 0 1 1 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 2 [Felzenszwalb and Huttenlocher 1996] DT f ( p) min(( p q) f ( q))

Medial Axis Transform Input, binary grid of Serial running time 3 n GPU threads, per pass O(k) time for CDT O(1) for qualifier O(1) for resolve [Lee and Horng 1996] T 3 n O( kn cells 3 ) MAT ( i, j, k) min{max( i x, i x N, j y N, k j y, k z )} z N Chess Distance Transform i 1 x i, j 1 y!( x i & & y Resolve j, k 1 z k j & & z k) Qualify T[ i, j, k] max{ MAT ( x, y, z)} MAT ( i, j, k)

Data Layout Persistent resources Reside in global memory Thread aligned data Better coalescing Consistent access pattern e0 e1 2 e e n 1 v0 v1 2 { float, int} v n v 1 { offset, count } Improves bandwidth t t 0 1 t2 t n 1 Variable Length Vector Access

K-Nearest Neighbor Naïve, exhaustive search O( n 2 ) system running time Spatial hash 3D point to a 1D index Per frame table build Current agents position agent sample h( p) determinant( p, p ref )

Nested Parallel Flat parallel limiting n grids, each of v velocity threads Thread grid DAG agent n 1 Independent grids Same kernel per level simulate agent 2 agent 1 update Thread amplification agent 0 Improved occupancy candidate data dependency Thread Grid DAG

Velocity Threads Hundreds of threads Graceful grid sync Fine reduce-min Into Shared memory Global atomic CAS Inter thread block global void candidate(cuagent* agents, int index, CUNeighbor* neighbors) { float3 v, float t; CUAgent a = agents[index]; } if(!getthreadid()) v = a.prefvelocity; else v = velocitysample(a); t = neighbor(a, agents, neighbors, v); float p = penalty(a, v, t); reduceminatomiccas(a, p); sync if(p == a.minpenalty) a.candidate = v;

Methodology Environment Vista 32 CUDA 2.3, OpenMP Simulation-only Flat and nested parallel Copy to/from device included Property NVIDIA GTX280 Intel i7-940 X7350 Core Clock (MHz) 601 2942 2930 Memory Clock (MHz) 1107 3*1066 1066 Global Memory (MB) 1024 3072 8192 Compute Units 30 8 4 Threads/Compute Unit 512 1 4

Experiments I Vertices Game Level Faces Grid Resolution Distance GPU Threads Medial Axis 82800 34750 33 1089 35937 161463 64451 40 1600 64000 347223 170173 55 3025 166375

Roadmap Transform

Experiments II Timestep Proximity Velocity Samples Neighbors Distance Frames 0.1 10 15 250 1200 Dataset Segments Roadmap Nodes Agents Compute Units Evacuation 211 429 500 20000 1 40

Spatial Hash

Frame Rate

Experiments III Dataset Agents Velocity Samples Velocity Threads Frames Simple 4 250 1000 607 Car 13 500 6500 228 Robots 24 400 9600 455 Circle 32 500 16000 1 596 1 Exceeds max GPU threads: 15360

Nested Parallel

Interaction AI, physics interop Compute context pair One active at a time Multi GPU formation Compute, graphics Single interop port Collision Avoidance velocity AI Physics Graphics position Collision Response position Simulation Framework

Mapping GAI Obstacles Static Moving Agents simulate Dynamic Actors Kinematic Actors Static Actors PhysX

Compute Interop Shared Actor buffers Shape, velocity In global memory AI controls simulation Interleaved frame Implicit synchronization GAI n 1 n frame Velocity Buffer PhysX GAI Shape Buffer PhysX

Limitations Small, SW thread cache Not fully 3D aware Hash table build Single threaded Thread load imbalance Non, at-goal agent mix

Performance Roadmap Transform Parameter NVIDIA GTX280 Intel CPU Speedup Up to 14.5X Running Time (sec) 0.7 @170K face level i7-940 8 Threads Collision Avoidance Hash vs. Naive Up to 4X Little to no effect 1 Nested vs. Flat Up to 6.2X Not easy to program Simulation Speedup Up to 4.8X X7350 Simulation FPS 18 @ 10K agents 16 Threads 1 1 Van Den Berg et al., RVOLib 2008

Future Work Fermi architecture scale Full 3D collision avoidance Parallel hash build Up hash sampling quality Evaluate UNC s FVO Property GT200 Fermi Threads / SM (32 regs) 512 1024 L1 Cache None 48KB L2 Cache None 768KB

Summary Dynamic roadmap, goals Multi agent solution Compact, scalable Fermi optimizations Nested parallel potential AI, physics integration

Thank You!

Info White Paper: http://tinyurl.com/multiagentgpu-paper-2009 Video: http://vimeo.com/7047078 CUDA: http://www.nvidia.com/cuda OpenCL: http://www.nvidia.com/object/cuda_opencl.html DirectCompute:http://www.nvidia.com/object/directcompute.html Nexus: http://developer.nvidia.com/object/nexus.html