Using Tasking to Scale Game Engine Systems

Similar documents
Intel Desktop Board DZ68DB

Intel Desktop Board D945GCLF2

LED Manager for Intel NUC

Intel vpro Technology Virtual Seminar 2010

Intel Cache Acceleration Software for Windows* Workstation

Intel Desktop Board D946GZAB

Intel Desktop Board DG41CN

Intel Desktop Board D975XBX2

Intel Desktop Board DG31PR

Intel Stereo 3D SDK Developer s Guide. Alpha Release

Intel Desktop Board DG41RQ

Intel Desktop Board D945GCCR

Sample for OpenCL* and DirectX* Video Acceleration Surface Sharing

How to Create a.cibd File from Mentor Xpedition for HLDRC

Intel Desktop Board DH61SA

Software Evaluation Guide for ImTOO* YouTube* to ipod* Converter Downloading YouTube videos to your ipod

How to Create a.cibd/.cce File from Mentor Xpedition for HLDRC

Intel Desktop Board DP55SB

Intel Atom Processor E3800 Product Family Development Kit Based on Intel Intelligent System Extended (ISX) Form Factor Reference Design

Intel Desktop Board D945GCLF

Software Occlusion Culling

Intel Desktop Board DH61CR

INTEL PERCEPTUAL COMPUTING SDK. How To Use the Privacy Notification Tool

Intel Setup and Configuration Service. (Lightweight)

Intel Desktop Board DP67DE

Intel Desktop Board DH55TC

Evolving Small Cells. Udayan Mukherjee Senior Principal Engineer and Director (Wireless Infrastructure)

Intel G31/P31 Express Chipset

Migration Guide: Numonyx StrataFlash Embedded Memory (P30) to Numonyx StrataFlash Embedded Memory (P33)

Intel vpro Technology Virtual Seminar 2010

Software Evaluation Guide for WinZip 15.5*

Intel IXP400 Digital Signal Processing (DSP) Software: Priority Setting for 10 ms Real Time Task

Intel 82580EB/82580DB GbE Controller Feature Software Support. LAN Access Division (LAD)

Boot Agent Application Notes for BIOS Engineers

Intel Desktop Board D845PT Specification Update

Intel 945(GM/GME)/915(GM/GME)/ 855(GM/GME)/852(GM/GME) Chipsets VGA Port Always Enabled Hardware Workaround

OpenCL* and Microsoft DirectX* Video Acceleration Surface Sharing

Intel Desktop Board D102GGC2 Specification Update

Installation Guide and Release Notes

Intel Desktop Board DQ57TM

Intel Desktop Board DQ35JO

Intel Desktop Board DP45SG

Drive Recovery Panel

Intel X48 Express Chipset Memory Controller Hub (MCH)

Software Evaluation Guide for Photodex* ProShow Gold* 3.2

Software Evaluation Guide for CyberLink MediaEspresso *

Intel 848P Chipset. Specification Update. Intel 82848P Memory Controller Hub (MCH) August 2003

Bitonic Sorting Intel OpenCL SDK Sample Documentation

Intel Platform Administration Technology Quick Start Guide

2013 Intel Corporation

Software Evaluation Guide for Sony Vegas Pro 8.0b* Blu-ray Disc Image Creation Burning HD video to Blu-ray Disc

Intel RealSense Depth Module D400 Series Software Calibration Tool

Case Study: Optimizing King of Soldier* with Intel Graphics Performance Analyzers on Intel HD Graphics 4000

Bitonic Sorting. Intel SDK for OpenCL* Applications Sample Documentation. Copyright Intel Corporation. All Rights Reserved

Intel Education Theft Deterrent Release Note WW16'14. August 2014

Intel USB 3.0 extensible Host Controller Driver

Introduction. How it works

Software Evaluation Guide Adobe Premiere Pro CS3 SEG

Intel Desktop Board D915GUX Specification Update

Intel Desktop Board D915GEV Specification Update

Intel Desktop Board D945GSEJT

Collecting OpenCL*-related Metrics with Intel Graphics Performance Analyzers

Upgrading Intel Server Board Set SE8500HW4 to Support Intel Xeon Processors 7000 Sequence

Theory and Practice of the Low-Power SATA Spec DevSleep

Intel Platform Controller Hub EG20T

Intel Extreme Memory Profile (Intel XMP) DDR3 Technology

Software Evaluation Guide for WinZip* esources-performance-documents.html

Intel Cache Acceleration Software - Workstation

TLBs, Paging-Structure Caches, and Their Invalidation

Enabling DDR2 16-Bit Mode on Intel IXP43X Product Line of Network Processors

Intel Setup and Configuration Service Lite

Innovating and Integrating for Communications and Storage

Intel Parallel Studio XE 2011 for Windows* Installation Guide and Release Notes

Intel Thread Checker 3.1 for Windows* Release Notes

This user manual describes the VHDL behavioral model for NANDxxxxxBxx flash memory devices. Organization of the VHDL model delivery package

Intel and Badaboom Video File Transcoding

IEEE1588 Frequently Asked Questions (FAQs)

Installation Guide and Release Notes

Intel SDK for OpenCL* - Sample for OpenCL* and Intel Media SDK Interoperability

Open FCoE for ESX*-based Intel Ethernet Server X520 Family Adapters

OpenCL* Device Fission for CPU Performance

Data Center Energy Efficiency Using Intel Intelligent Power Node Manager and Intel Data Center Manager

Intel Open Source HD Graphics, Intel Iris Graphics, and Intel Iris Pro Graphics

Intel RealSense D400 Series Calibration Tools and API Release Notes

Non-Volatile Memory Cache Enhancements: Turbo-Charging Client Platform Performance

Efficient scaling in a Task-Based Game Engine

Intel IXDPG465 Reference Platform Bootloader LSP Release Notes Version 1.0 May 2006

Forging a Future in Memory: New Technologies, New Markets, New Applications. Ed Doller Chief Technology Officer

Intel 852GME/852PM Chipset Graphics and Memory Controller Hub (GMCH)

Intel & Lustre: LUG Micah Bhakti

Optimizing the operations with sparse matrices on Intel architecture

SELINUX SUPPORT IN HFI1 AND PSM2

Intel Core 4 DX11 Extensions Getting Kick Ass Visual Quality out of the Latest Intel GPUs

Intel Atom Processor D2000 Series and N2000 Series Embedded Application Power Guideline Addendum January 2012

Intel Core TM Processor i C Embedded Application Power Guideline Addendum

MICHAL MROZEK ZBIGNIEW ZDANOWICZ

Intel Atom Processor E6xx Series Embedded Application Power Guideline Addendum January 2012

Intel Platform Controller Hub EG20T

Customizing an Android* OS with Intel Build Tool Suite for Android* v1.1 Process Guide

How to Configure Intel X520 Ethernet Server Adapter Based Virtual Functions on SuSE*Enterprise Linux Server* using Xen*

Transcription:

Using Tasking to Scale Game Engine Systems Yannis Minadakis March 2011 Intel Corporation

2

Introduction Desktop gaming systems with 6 cores and 12 hardware threads have been on the market for some time now, and 4-core CPUs are becoming increasingly common even on laptops. To give your customers the best experience on their particular platform, we want to write game engine code that is core count independent. Tasking allows allows programs to benefit from scaling as core counts are increased, giving our users a great game experience consistent with their choice of hardware. In this sample, we will convert a single-threaded animation system into a tasking animation system. Tasking Terms Task: A task is an independent unit of work in a system. A task is implemented as a callback function. In this example we are tasking the animation system. Each task animates a model in the scene. Taskset: The taskset is the scheduling primitive for the application. To schedule work the application creates a taskset, specifying the task function and any dependency information. Dependency graph: The tasks inside a taskset execute asynchronously. Tasksets execute only when their dependencies have been satisfied. Scheduler: The scheduler is internal to the Tasking API. The scheduler is responsible for creating and managing all the worker threads and executing the tasks. The sample uses Intel Threading Building Blocks as its scheduler. Figure 1 illustrates a dependency graph of tasksets. Below the dependency graph is a representation of a 4-core system. The scheduler executes each task in the set as soon as all the dependencies have been satisfied. 3

Figure 1 Visualization of a dependency graph and its execution by the scheduler. Making a Good Task The task function takes four parameters (see Listing 1): pvinfo: A pointer to the data that is global to the taskset. This data could be thought of as a constant buffer. The actual data is not owned by the Tasking API, so the application is responsible for making sure that the data is valid while the taskset is executing. icontext: An index that ranges from zero to the number of threads created by the scheduler. This value allows lock-free access to data that is not thread-safe. The Tasking API guarantees only one executing task can have a particular context ID. The canonical usage for the context ID is the D3D11DeviceContext. The application can setup an array of D3D11DeviceContext objects. A task that is rendering into a command list can use the context ID to select a D3D11DeviceContext object from the array without taking any locks. utaskid: The task ID indicates which task is executing in the task function. This index is used to select the work of the task from a list. In the sample, it is used to select the model to animate. utaskcount: The task count is the total number of tasks that are scheduled to execute in the taskset. This value can be used to map the task ID to a set of objects to process. Void AnimateModel( VOID* pvinfo, INT icontext, UINT utaskid, UINT utaskcount ); Listing 1. Task function signature To make a good task, we want to leverage the architecture of the CPU and allow the scheduler to spread the work to the cores as efficiently as possible. Follow the set of heuristics below for optimal task execution. Task running time should be a fraction of the total taskset time. This allows the scheduler to optimally partition work across the CPU cores. Scheduling a task takes constant time, so it is important to balance the number of tasks to maximize efficient scaling while minimizing scheduling time. For simplicity, the sample task animates a model, so this heuristic is only achieved when there are 20 or more models in the scene. Using the task ID and task count to vary the amount of data processed per task is the best way to find the right balance. Target the working set of the task to 1/4 of the L2 cache size. Many CPUs feature Intel Hyper-Threading Technology, so there will be two worker threads per physical core. Targeting 1/4 of the cache per task should allow for optimal cache use. Using the task ID and task count to vary the amount of data processed per task is the best way to find the right balance. Use the context ID instead of locking and interlocked instructions. Locking inside a task will block that worker thread severely reducing task execution performance. This includes allocating memory which generally requires locking. Even interlocked operations can cause thrashing in the cache. If the taskset needs locking, consider creating two tasksets (see below) 4

Avoid drain-out time by overlapping frames. Drain-out occurs when a taskset is about to complete and there are no tasksets in the scheduler that are ready to run. (see below) Using Dependencies to Avoid Interlocked Operations Let us consider the example of computing the average luminance of an image using tasking. The first approach would have each task process a set of scan lines and use interlocked exchange add to track the sum. The task will likely be bound on the time it takes to lock the cache line which contains the sum to increment. A better approach would be to create an array of sums. Each task would use the context ID to add the current sum without an interlocked operation. A second taskset containing one task could then be used to sum the array and compute the final average. The second taskset would depend on the first. Avoiding Drain-out Figure 2 shows drain-out time (circled in red) which results from executing the dependency graph. This time cannot be utilized since there are no tasksets in the scheduler that are ready to run. Assuming the dependency graph represents the work to be done in frame, we can execute two frames on the CPU at once. The drain-out time from the dependency graph of frame n will be filled in the work from frame n -1 (see Figure 3). Figure 2 Dependency graph executing with drain-out highlighted 5

Figure 2 Overlapped dependency graph execution Using Tasking in the Sample The animation sample defines its task to animate one model (see listing 2). To animate the scene we create a taskset where the number of tasks created are equal to the number of models to animate. The task uses the task ID (umodel) to identify which model from the list to animate in this task. The global data pointer (pvinfo) contains the animation time for this frame. void AnimateModel( VOID* pvinfo, INT icontext, UINT umodel, UINT utaskcount ) { D3DXMATRIXA16 midentity; PerFrameAnimationInfo* pinfo = (PerFrameAnimationInfo*)pvInfo; D3DXMatrixIdentity( &midentity ); gmodels[ umodel ].Mesh.TransformMesh( &midentity, pinfo->dtime + gmodels[ umodel ].dtimeoffset ); for( UINT umesh = 0; umesh < gmodels[ umodel ].Mesh.GetNumMeshes(); ++umesh ) { for( UINT umat = 0; umat < gmodels[ umodel ].Mesh.GetNumInfluences( umesh ); ++umat ) { const D3DXMATRIX *pmat; pmat = gmodels[ umodel ].Mesh.GetMeshInfluenceMatrix( umesh, umat ); D3DXMatrixTranspose( 6

} } } &gmodels[ umodel ].AnimatedBones[ umesh ][ umat ], pmat ); Listing 2. The AnimateModel task To execute the animation taskset we create the taskset in the OnFrameMove function with the global data and the number of models currently displayed. In Listing 3, the taskset handle (ghanimateset) is later used in OnD3D11FrameRender() in WaitForSet to yield execution time from the main thread to process any remaining animation tasks. When WaitForSet returns, the animation taskset has completed and its data is available to the main thread to submit to D3D. OnFrameMove() {... gtaskmgr.createtaskset( AnimateModel, &ganimationinfo, gumodels, NULL, 0, "Animate Models", &ghanimateset );... } OnD3D11FrameRender() {... gtaskmgr.waitforset( ghanimateset );... } Listing 3. Creating the animation taskset Figure 4 illustrates the sample running. Use the model count slider to change the model count. Toggle the "Enable Tasking" button to move from single-threaded to multi-threaded. Note that even when animating one model tasking is faster than executing the animation on the main thread. Toggle 7

"Force CPU Bound" to draw only the first triangle in the model to see the effects of tasking on a heavily CPU bound scenario. Figure 4 Animation sample executing on ATI Radeon* HD 5870 graphics Figure 5 shows the performance cost of increasing the number of animating models in the sample. The data was gathered on an Intel Core i7 processor-based system with ATI Radeon* HD 5870 graphics. Note that as we increase the number of animating objects we scale efficiently without specializing the code to the core count. 8

Figure 5 Performance data from animation sample. Time is in milliseconds per frame. Lower values mean higher performance. References [1] GDC 2011 presentation on tasking. [2] Intel Threading Building Blocks http://threadingbuildingblocks.org/ 9

10

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Intel products are not intended for use in medical, life saving, life sustaining, critical control or safety systems, or in nuclear facility applications. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. This white paper, as well as the software described in it, is furnished under license and may only be used or copied in accordance with the terms of the license. The information in this document is furnished for informational use only, is subject to change without notice, and should not be construed as a commitment by Intel Corporation. Intel Corporation assumes no responsibility or liability for any errors or inaccuracies that may appear in this document or any software that may be provided in association with this document. Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. See www.intel.com/products/processor_number for details. The Intel processor/chipset families may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request. Copies of documents, which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or by visiting Intel's Web Site. Intel and the Intel Logo are trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others. Copyright 2011, Intel Corporation. All rights reserved. 11