Multicore Strategies for Games Prof. Aaron Lanterman School of Electrical and Computer Engineering Georgia Institute of Technology
Bad multithreading Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Slide from Bruce Dawson & Chuck Walbourn, Microsoft Game
Good multithreading Physics Game Thread Main Thread Rendering Thread Animation/ Skinning Particle Systems Networking File I/O Slide from Bruce Dawson & Chuck Walbourn, Microsoft Game
Another paradigm: cascades Thread 1: Input Thread 2: Physics Thread 3: AI Thread 4: Rendering Thread 5: Present Advantages: Synchronization points are few and well-defined Disadvantages: Increases latency (for constant frame rate) Needs simple (one-way) data flow For balance, each chunk needs to take a similar amount of time Slide from Bruce Dawson & Chuck Walbourn, Microsoft Game
Typical task: file decompression Most common CPU heavy thread on the Xbox 360 Easy to multithread Allows use of aggressive compression to improve load times Don t throw a thread at a problem better solved by offline processing Texture compression, file packing, etc. Slide from Bruce Dawson & Chuck Walbourn, Microsoft Game
Typical task: rendering Separate update and render threads Rendering on multiple threads usually works poorly GPU can have trouble if multiple threads try to talk to it at once (Xbox 360 command buffers are supposed to be OK) Special case of cascades paradigm Pass render state from update to render Slideadapted from Bruce Dawson & Chuck Walbourn, Microsoft Game
Separate rendering thread Update Thread Buffer 0 Buffer 1 Render Thread Slide from Bruce Dawson & Chuck Walbourn, Microsoft Game
Typical task: graphics fluff Extra graphics that doesn t affect play Procedurally generated animating cloud textures Cloth simulations Procedurally generated vegetation, etc. Extra particles, better particle physics, etc. Can run at lower frame rate Easy to synchronize One game had one thread manipulating cloth, another thread handling cloth shadows On single-core machines, can drop or simplify the fluff without effecting gameplay Slide adapted from Bruce Dawson & Chuck Walbourn, Microsoft Game
Typical tasks: physics? Could cascade from update to physics to rendering Makes use of three threads May be too much latency Could run physics on many threads Uses many threads while doing physics May leave threads mostly idle elsewhere Slide from Bruce Dawson & Chuck Walbourn, Microsoft Game
Careful with simultaneous multi-threading Not the same as double the number of cores Can give a small performance boost if first thread is underutilizing execution resources because of dependency stalls Can cause a performance drop Two threads may fight over L1 cache Can avoid scheduler latency Have a thread that is ready to run but OS waits for current scheduling quantum to expire before running the thread Hardware threads can wake up faster; works well if you have a thread that mostly sleeps but needs to wake quickly on demand Slide adapted from Bruce Dawson & Chuck Walbourn, Microsoft Game
How many threads? No more than one CPU intensive software thread per core 3-6 on Xbox 360 1-? On PC (1-4 for now, need to query) Too many busy threads adds complexity and lowers performance Context switches are not free Can have many non-cpu intensive threads I/O threads that block, or intermittent tasks Slide from from Bruce Dawson & Chuck Walbourn, Microsoft Game
Rare s Kameo Screenshots from www.rareware.com 12
Case study: Kameo (1) Started out as single threaded Was going to be an original Xbox game, but decided to and make it a 360 launch title CPU usage split was 51/49 for update/render, so rendering was put on separate thread Two render-description buffers created to communicate from update to render Linear read/write access for best cache usage Doesn t copy const data Slide adapted from Bruce Dawson & Chuck Walbourn, Microsoft Game
Case study: Kameo (2) Decompression thread: Saved space on DVD and improved load times Cost was some spare CPU cycles Actually two threads for file I/O One for reading and one for decompressing, because some calls can block for ~0.5s doing directory lookups Multithreading added about six months before launch - but it worked! Slide adapted from Bruce Dawson & Chuck Walbourn, Microsoft Game
Case study: Kameo (3) Core Thread Software threads 0 80-99% 1 50% 2 80-99% 0 Game update 1 File I/O 0 Rendering 1 0 XAudio 1 File decompression Total usage was ~2.2-2.5 cores Screenshot from www.rareware.com Slide adapted from Bruce Dawson & Chuck Walbourn, Microsoft Game
Bizarre Creations Project Gotham Racing 3 See http://media.xbox360.gamespy.com/media/741/741362/vids_1.html for movie clips Screenshot from projectgothamracing3.com/screenshots
Case Study: Project Gotham Racing 3 Screenshot from projectgothamracing3.com/screenshots Core Thread Software threads 0 1 2 0 Update, physics, rendering, UI 1 Audio update, networking 0 Crowd update, texture decompression 1 Texture decompression 0 XAudio 1 Total usage was ~2.0-3.0 cores Slide adapted from Bruce Dawson & Chuck Walbourn, Microsoft Game 17
Available synchronization objects Critical sections (locks) Semaphores (alas not in XNA) Mutexes Don t suspend threads Some games have used this for synchronization Can easily lead to deadlocks Interacts badly with Visual Studio debugger Slide adapted from Bruce Dawson & Chuck Walbourn, Microsoft Game
Synchronization tips/costs: Synchronization is moderately expensive when there is no contention Hundreds to thousands of cycles Synchronization can be arbitrarily expensive when there is contention! Goals: Synchronize rarely Hold locks briefly Minimize shared data Slide from Bruce Dawson & Chuck Walbourn, Microsoft Game
Avoid effective single-threading Requiring exclusive access to a popular resource can make multi-threading a complex way of doing single-threading on multiple threads Want to use synchronization primitives to guarantee multiple threads won t modify resources simultaneously, while designing so that they generally won't anyway. Notes from Bruce Dawson & Chuck Walbourn, Microsoft Game
Beware hidden synchronization Memory allocation (i.e., malloc in C) All sorts of ways to alleviate the problem File access Using D3DCREATE_MULTITHREADED if developing with unmanaged code False sharing - artefact of cache structure Performance issue, not a correctness issue Bruce Dawson, Multicore Memory Coherence: The Hidden Perils of Sharing Data, PowerPoint presentation! Information from Bruce Dawson & Chuck Walbourn, Microsoft Game
Things to avoid Threads terminating other threads Can t do it on Xbox 360, discouraged on Windows Mutexes Aren t as fast as critical section locks Information from Bruce Dawson & Chuck Walbourn, Microsoft Game
Lockless programming Spin locks Write-release/read-acquire semantics Interlocked instructions Difficult to get right: Very hard for native C++ Xbox 360 coding.net makes some of this easier Bruce Dawson, Lockless Programming Considerations for Xbox 360 and Microsoft Windows msdn2.microsoft.com/en-us/library/bb310595.aspx Information from Bruce Dawson & Chuck Walbourn, Microsoft Game
What about OpenMP? #pragma omp parallel default(none) shared(n,x,y) private(i)! {! #pragma omp for! for (i=0; i < n; i++)!!x[i] += y[i];! } Industry tends to shy away from OpenMP and similar solutions Prefers more direct control (Example from somewhere on web; can t remember where)
XNA specific notes (1) GraphicsDevice is somewhat thread-safe Cannot render from more than one thread at a time Can create resources and SetData while another thread renders ContentManager is not thread-safe OK to have multiple instances, but only one per thread Input is not threadable Windows games must read input on the main game thread Audio and networking are thread-safe Slide from Shawn Hargreaves, Understanding XNA Framework Performance
XNA specific notes (2) Catalin s suggestion: Keep rendering on main thread (Thread 1 on Xbox 360) Game class does some behind-thescenes graphics stuff Great article: Catalin Zima, Multi-threading for your XNA Game, http://www.ziggyware.com/readarticle.php?article_id=221
Common mistake Creating a new thread on every iteration of the game loop Creating and releasing threads has a lot of overhead especially if you are running in Visual Studio (i.e. in the debugger ) and especially if you are running on the Xbox 360 from Visual Studio Better to create the threads you need at the beginning
Take a step back Always ask: should I be doing this on the CPU at all? GPU has ridiculous amounts of computing power Look for tasks with high compute per CPU- GPU communication ratio HLSL is HLSL whether you re using managed or unmanaged code on the CPU