GAME PROGRAMMING ON HYBRID CPU-GPU ARCHITECTURES TAKAHIRO HARADA, AMD DESTRUCTION FOR GAMES ERWIN COUMANS, AMD

Size: px

Start display at page:

Download "GAME PROGRAMMING ON HYBRID CPU-GPU ARCHITECTURES TAKAHIRO HARADA, AMD DESTRUCTION FOR GAMES ERWIN COUMANS, AMD"

Alexander Harrington
5 years ago
Views:

1 GAME PROGRAMMING ON HYBRID CPU-GPU ARCHITECTURES TAKAHIRO HARADA, AMD DESTRUCTION FOR GAMES ERWIN COUMANS, AMD

2 GAME PROGRAMMING ON HYBRID CPU-GPU ARCHITECTURES Jason Yang, Takahiro Harada AMD

CPU(1.9GHz-2.6GHz) GPU(HD6620G, 5 SIMD engines) Desktop: A8-3870K CPU(3.

3 HYBRID CPU-GPU ARCHITECTURE CPU and GPU are good at different works CPU: serial computation, conditional branch GPU: parallel computation CPU and GPU are: On the same die Much closer Efficient data sharing AMD Fusion Ontario Llano Mobile: A8-3530MX CPU(1.9GHz-2.6GHz) GPU(HD6620G, 5 SIMD engines) Desktop: A8-3870K CPU(3.1GHz) GPU(HD6550D) Able to dispatch works to: Serial work with varying granularity CPU Parallel work with the uniform granularity GPU 3 Physics simulation on fusion architecture June, 2011

4 PHYSICS SIMULATION Particles Cloth Elastic bodies Rigid bodies Fluids 4 Physics simulation on fusion architecture June, 2011

5 PARTICLE BASED SIMULATION ON THE GPU Large number of particles Particles with identical size Work granularity is almost the same Good for the wide SIMD architecture Harada et al Physics simulation on fusion architecture June, 2011

GPU ARCHITECTURE AMD Radeon TM HD 6970 2.

6 GPU ARCHITECTURE AMD Radeon TM HD TFLOPS(S), 638GFLOPS(D), 176GB/sec Many cores 24SIMDs x 64 wide SIMD CPU SSE 4 wide SIMD Program of a work item is packed in VLIW, then executed SIMD SIMD SIMD SIMD AMD Radeon TM HD 6970 SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD AMD Phenom TM II X4 Core Core Core Core SIMD Core 6 Physics simulation on fusion architecture June, 2011

7 PARTICLE BASED SIMULATION f collide = f ij Integration dv dt = f m dx dt = v v += f m t x += v t Acceleration structure is used for efficient collide Uniform grid Suited for the GPU Less divergence 7 Physics simulation on fusion architecture June, 2011

8 DIVERGENCE ON SIMD Void Kernel() { if(a) FuncA(); else if(b) FuncB(); else FuncC(); } 8 Physics simulation on fusion architecture June, 2011

9 PARTICLE BASED SIMULATION ON THE GPU Particle collision using a uniform grid Void Kernel() { prepare(); Cell0 Cell1 Cell2 Cell3 Cell4 Cell5 Cell6 Cell7 Cell8 collide(cell0); collide(cell1); collide(cell2); collide(cell3); collide(cell4); collide(cell5); } collide(cell6); collide(cell7); collide(cell8); 9 Physics simulation on fusion architecture June, 2011

10 MIXED PARTICLE SIMULATION Not only small particles Difficulty for GPUs Large particles interact with small particles Large-large collision 10 Physics simulation on fusion architecture June, 2011

11 CHALLENGE Non uniform work granularity Small-small(SS) collision GPU Large-small(LS) collision CPU Large-large(LL) collision CPU Discrete CPU+GPU Need data transfer 11 Physics simulation on fusion architecture June, 2011

12 FUSION ARCHITECTURE CPU and GPU are: On the same die Much closer Efficient data sharing CPU and GPU are good at different works CPU: serial computation, conditional branch GPU: parallel computation Able to dispatch works to: Serial work with varying granularity CPU Parallel work with the uniform granularity GPU 12 Physics simulation on fusion architecture June, 2011

13 MIXED PARTICLE SIMULATION Benefit from Fusion Architecture Different works in a simulation CPU & GPU are working together Shares data 13 Physics simulation on fusion architecture June, 2011

14 LS TWO SIMULATIONS Small particles Position Velocity Force Grid Build Acc. Structure SS S Integration Large particles Position Velocity Force Build Acc. Structure LL L Integration 14 Physics simulation on fusion architecture June, 2011

15 CLASSIFY BY WORK GRANULARITY Small particles Position Velocity Force Grid Build Acc. Structure Uniform Work SS S Integration Large particles Position Velocity Force Build Acc. Structure Non Uniform Work LL LS L Integration 15 Physics simulation on fusion architecture June, 2011

16 CLASSIFY BY WORK GRANULARITY, ASSIGN PROCESSOR Small particles Position Velocity Force Grid Build Acc. Structure GPU SS S Integration Large particles Position Velocity Force Build Acc. Structure LL CPU LS L Integration 16 Physics simulation on fusion architecture June, 2011

17 DATA SHARING Small particles Position Velocity Force Grid Build Acc. Structure GPU SS S Integration Large particles Position Velocity Grid Force Position Velocity Force Build Acc. Structure LL CPU LS L Integration Grid, small particle data has to be shared with the CPU for LS collision Allocated as zero copy buffer 17 Physics simulation on fusion architecture June, 2011

18 Synchronization Synchronization SYNCHRONIZATION Small particles Position Velocity Force Grid Build Acc. Structure GPU SS S Integration Large particles Position Velocity Grid Force Position Velocity Force Build Acc. Structure LL CPU LS L Integration Grid, small particle data has to be shared with the CPU for LS collision Allocated as zero copy buffer 18 Physics simulation on fusion architecture June, 2011

19 L Integration Synchronization S Integration VISUALIZING WORKLOADS Small particles Position Velocity Force Grid SS GPU Build Acc. Structure Large particles Position Velocity Force LL LS CPU Grid construction can be moved at the end of the pipeline Unbalanced workload 19 Physics simulation on fusion architecture June, 2011

20 L Integration Synchronization S Integration LOAD BALANCING Small particles Position Velocity Force Grid SS GPU Build Acc. Structure Large particles Position Velocity Force LS CPU LL To get better load balancing The sync is for passing the force buffer filled by the CPU to the GPU Move the LL collision after the sync 20 Physics simulation on fusion architecture June, 2011

21 CPU Work GPU Work 21 Physics simulation on fusion architecture June, 2011

22 CPU Work GPU Work 22 Physics simulation on fusion architecture June, 2011

23 UTILIZING 4 CPU CORES 23 Physics simulation on fusion architecture June, 2011

24 Synchronization L Integ. S Integ. FURTHER OPTIMIZATION 1. Not optimized for Llano which is a 4 core CPU Only 2 CPU core were used Can use 2 more cores for LS collision 2. LL collision was not optimized CPU waits when the GPU was constructing a grid Use CPU to improve SS collision SS LS GPU CPU0 CPU1 LL Build Acc. Structure CPU2 24 Physics simulation on fusion architecture June, 2011

25 OPTIMIZATION1: MULTITHREADING LARGE-SMALL COLLISION Cannot split the work by large particle indices More than 1 large particle can collide with a small particle Have to lock the memory on write Inefficient L0 S0 S1 L1 Prepare a local buffer for a thread A buffer storing force on small particles Lock free Local buffers are merged to one Thread0 Thread1 Thread2 25 Physics simulation on fusion architecture June, 2011

26 Synchronization L Integ. S Integ. OPTIMIZATION1: MULTITHREADING LARGE-SMALL COLLISION SS GPU Build Acc. Structure LS CPU0 LL CPU1 CPU2 26 Physics simulation on fusion architecture June, 2011

27 Merge Synchronization Merge Synchronization Merge L Integ. S Integ. OPTIMIZATION1: MULTITHREADING LARGE-SMALL COLLISION SS GPU Build Acc. Structure CPU0 LS LL CPU1 LS CPU2 LS 27 Physics simulation on fusion architecture June, 2011

28 OPTIMIZATION2: IMPROVING SMALL-SMALL COLLISION Spatially coherent memory layout improves cache utilization As particles move, spatial locality decreases 28 Physics simulation on fusion architecture June, 2011

29 OPTIMIZATION2: IMPROVING SMALL-SMALL COLLISION Spatially coherent memory layout improves cache utilization As particles move, spatial locality decreases 29 Physics simulation on fusion architecture June, 2011

30 SPATIAL SORT Sort particles by spatial location to improve cache utilization Z curve 30 Physics simulation on fusion architecture June, 2011

31 SPATIAL SORT Sort particles by spatial location to improve cache utilization Z curve 31 Physics simulation on fusion architecture June, 2011

32 CHOOSE SORT Requirements Full sort was over the budget Full sort is not a must Sort is an optional computation for performance improvement Incremental sort Use multiple threads Solution Used generalized Odd-even transition sort 32 Physics simulation on fusion architecture June, 2011

33 BLOCK TRANSITION SORT Generalized Odd-even transition sort Instead of sorting 2 adjacent elements, sort adjacent 2 blocks Iterate until convergence Odd-even transition sort Use a thread to sort 2 adjacent blocks 6 blocks for 3 threads Radix sort Block transition sort 33 Physics simulation on fusion architecture June, 2011

34 Merge Synchronization Merge Synchronization Merge L Integ. S Integ. OPTIMIZATION2: IMPROVING SMALL-SMALL COLLISION SS GPU Build Acc. Structure CPU0 LS LL CPU1 LS CPU2 LS 34 Physics simulation on fusion architecture June, 2011

35 Merge Synchronization Merge Synchronization Synchronization Merge LL Coll. L Integ. S Integ. OPTIMIZATION2: IMPROVING SMALL-SMALL COLLISION SS GPU Build Acc. Structure CPU0 LS S Sorting CPU1 LS S Sorting CPU2 LS S Sorting 35 Physics simulation on fusion architecture June, 2011

36 DATA DEPENDENCIES L Build Acc. S Build Acc. S Broad phase S Sort Sort Sort LS Collide LS Collide LS Collide M LL Collide M Merge Merge Merge M SS Collide LL Integrate S Integrate 36 Physics simulation on fusion architecture June, 2011

37 CPU Work GPU Work DEMO 37 Physics simulation on fusion architecture June, 2011

38 CPU Work GPU Work DEMO 38 Physics simulation on fusion architecture June, 2011

39 CONCLUSIONS Realized a simulation that handles variable sized particles by leveraging the best features of both the CPU and GPU on AMD s Fusion Architecture The CPU is used for works with non identical compute granularity The GPU is used for highly parallel works Memory sharing between the CPU and GPU is the key for the efficiency Avoid wasteful memory copies 39 Physics simulation on fusion architecture June, 2011

40 REFERENCE Takahiro Harada, Seiichi Koshizuka, Yoichiro Kawaguchi, Smoothed Particle Hydrodynamics on GPUs, Proc. of Computer Graphics International, 63-70(2007) Justin Hensley, Takahiro Harada, Chapter X OpenCL Case Study:Mixed Particle Simulation, Heterogeneous Computing with OpenCL, Morgan Kaufmann(2011) 40 Physics simulation on fusion architecture June, 2011

Bullet Cloth Simulation. Presented by: Justin Hensley Implemented by: Lee Howes

Bullet Cloth Simulation. Presented by: Justin Hensley Implemented by: Lee Howes Bullet Cloth Simulation Presented by: Justin Hensley Implemented by: Lee Howes Course Agenda OpenCL Review! OpenCL Development Tips! OpenGL/OpenCL Interoperability! Rigid body particle simulation!