Scalable Multi Agent Simulation on the GPU. Avi Bleiweiss NVIDIA Corporation San Jose, 2009

Size: px

Start display at page:

Download "Scalable Multi Agent Simulation on the GPU. Avi Bleiweiss NVIDIA Corporation San Jose, 2009"

Jerome Gibson
6 years ago
Views:

1 Scalable Multi Agent Simulation on the GPU Avi Bleiweiss NVIDIA Corporation San Jose, 2009

2 Reasoning Explicit State machine, serial Implicit Compute intensive Fits SIMT well Collision avoidance

3 Motivation Games re-designed Casual, bursty Game Scalable, real time Dense environments Joint framework AI CUDA TM PhysX TM Compute API Graphics API DirectX AI, physics simulation Driver

4 Problem Planner Efficient mesh to roadmap conversion. Searches a global, optimal path From start to goal Locally, avoids collisions with Static, dynamic objects Simulator Visually compelling motion Economical memory footprint A subset of compute units Linear scale with # characters

5 Solution Small, quality roadmap Heterogeneous agents Velocity Obstacles GPU specific optimizations Spatial hash Nested parallel

6 Outline Algorithm Implementation Results Integration Takeaways

7 Pipeline Game level mesh input Inline computed roadmap Goals, roadmap decoupled Discrete time simulation Intuitive multi threading Level Roadmap Transform A* Search Collision Avoidance Velocity Goals State

8 Roadmap Transform Reachability roadmap 1 An existed C free path Guaranteed in roadmap Predictable termination Highly parallelizable 3D grid operators Level Voxel Distance Medial Axis Flood Fill Connectivity Roadmap 1 [Geraerts and Overmars 2005]

9 Visibility Two sets of edges Visible roadmap node pairs Goals to unblocked nodes Static obstacles outline A* search, shortest path From goal to any node node goal obstacle

Velocity Obstacles Well defined, widely used Avoidance velocity set 1 Reciprocal Velocity Obstacles 2 Oscillation free motion Agents moving in

10 Velocity Obstacles Well defined, widely used Avoidance velocity set 1 Reciprocal Velocity Obstacles 2 Oscillation free motion Agents moving in 2D plane 1 [Fiorini and Shiller 1998] 2 [Van Den Berg et al. 2008] A B B A pb va vb pa vb ( vb) va ( vb) RVO A B va v B 2 ( vb, va) VO A ( v B B )

11 Simulation Simulator advances until All agents reached goal Path realigned towards Roadmap node or goal Agent, velocity parallel do hash construct hash table simulate compute preferred velocity compute proximity scope foreach velocity sample do foreach neighbor do if OBSTACLE then VO elseif AGENT then RVO resolve new velocity update update position, velocity nested resolve at-goal while not all-at-goal flat

12 Workflow Deterministic resources Linear, pitched 3D Roadmap static for Multiple simulation steps Dozen compute kernels Split frame, multi GPU CPU Loop Control Hash Build table Simulator GPU Physics (Graphics) Collision Avoidance Frame Process

13 Challenges Hiding memory latency Divergent, irregular threads Small agent count ( 32) Hash construction cost

14 Distance Transform Input, binary grid of cells Squared Euclidian distance Serial running time Parallel linear time 3 n O( n 3 ) Slice, column, row passes n 2 GPU threads, per pass O(n) [Felzenszwalb and Huttenlocher 1996] DT f ( p) min(( p q) f ( q))

15 Medial Axis Transform Input, binary grid of Serial running time 3 n GPU threads, per pass O(k) time for CDT O(1) for qualifier O(1) for resolve [Lee and Horng 1996] T 3 n O( kn cells 3 ) MAT ( i, j, k) min{max( i x, i x N, j y N, k j y, k z )} z N Chess Distance Transform i 1 x i, j 1 y!( x i & & y Resolve j, k 1 z k j & & z k) Qualify T[ i, j, k] max{ MAT ( x, y, z)} MAT ( i, j, k)

16 Data Layout Persistent resources Reside in global memory Thread aligned data Better coalescing Consistent access pattern e0 e1 2 e e n 1 v0 v1 2 { float, int} v n v 1 { offset, count } Improves bandwidth t t 0 1 t2 t n 1 Variable Length Vector Access

17 K-Nearest Neighbor Naïve, exhaustive search O( n 2 ) system running time Spatial hash 3D point to a 1D index Per frame table build Current agents position agent sample h( p) determinant( p, p ref )

18 Nested Parallel Flat parallel limiting n grids, each of v velocity threads Thread grid DAG agent n 1 Independent grids Same kernel per level simulate agent 2 agent 1 update Thread amplification agent 0 Improved occupancy candidate data dependency Thread Grid DAG

19 Velocity Threads Hundreds of threads Graceful grid sync Fine reduce-min Into Shared memory Global atomic CAS Inter thread block global void candidate(cuagent* agents, int index, CUNeighbor* neighbors) { float3 v, float t; CUAgent a = agents[index]; } if(!getthreadid()) v = a.prefvelocity; else v = velocitysample(a); t = neighbor(a, agents, neighbors, v); float p = penalty(a, v, t); reduceminatomiccas(a, p); sync if(p == a.minpenalty) a.candidate = v;

20 Methodology Environment Vista 32 CUDA 2.3, OpenMP Simulation-only Flat and nested parallel Copy to/from device included Property NVIDIA GTX280 Intel i7-940 X7350 Core Clock (MHz) Memory Clock (MHz) * Global Memory (MB) Compute Units Threads/Compute Unit

21 Experiments I Vertices Game Level Faces Grid Resolution Distance GPU Threads Medial Axis

22 Roadmap Transform

23 Experiments II Timestep Proximity Velocity Samples Neighbors Distance Frames Dataset Segments Roadmap Nodes Agents Compute Units Evacuation

24 Spatial Hash

25 Frame Rate

26 Experiments III Dataset Agents Velocity Samples Velocity Threads Frames Simple Car Robots Circle Exceeds max GPU threads: 15360

27 Nested Parallel

28 Interaction AI, physics interop Compute context pair One active at a time Multi GPU formation Compute, graphics Single interop port Collision Avoidance velocity AI Physics Graphics position Collision Response position Simulation Framework

29 Mapping GAI Obstacles Static Moving Agents simulate Dynamic Actors Kinematic Actors Static Actors PhysX

30 Compute Interop Shared Actor buffers Shape, velocity In global memory AI controls simulation Interleaved frame Implicit synchronization GAI n 1 n frame Velocity Buffer PhysX GAI Shape Buffer PhysX

31 Limitations Small, SW thread cache Not fully 3D aware Hash table build Single threaded Thread load imbalance Non, at-goal agent mix

32 Performance Roadmap Transform Parameter NVIDIA GTX280 Intel CPU Speedup Up to 14.5X Running Time (sec) face level i Threads Collision Avoidance Hash vs. Naive Up to 4X Little to no effect 1 Nested vs. Flat Up to 6.2X Not easy to program Simulation Speedup Up to 4.8X X7350 Simulation FPS 10K agents 16 Threads 1 1 Van Den Berg et al., RVOLib 2008

33 Future Work Fermi architecture scale Full 3D collision avoidance Parallel hash build Up hash sampling quality Evaluate UNC s FVO Property GT200 Fermi Threads / SM (32 regs) L1 Cache None 48KB L2 Cache None 768KB

34 Summary Dynamic roadmap, goals Multi agent solution Compact, scalable Fermi optimizations Nested parallel potential AI, physics integration

35 Thank You!

36 Info White Paper: Video: CUDA: OpenCL: DirectCompute: Nexus:

Multi Agent Navigation on GPU. Avi Bleiweiss

Multi Agent Navigation on GPU. Avi Bleiweiss Multi Agent Navigation on GPU Avi Bleiweiss Reasoning Explicit Implicit Script, storytelling State machine, serial Compute intensive Fits SIMT architecture well Navigation planning Collision avoidance