Designing a Modern GPU Interface Brooke Hodgman ( @BrookeHodgman) http://tiny.cc/gpuinterface
How to make a wrapper for D3D9/11/12, GL2/3/4, GL ES2/3, Metal, Mantle, Vulkan, GNM & GCM without going (completely) insane Brooke Hodgman ( @BrookeHodgman) http://tiny.cc/gpuinterface
Agenda GPU Interface wrapper around the native GPU APIs for every platform Pipeline State management Resource management Shader program management (e.g. Microsoft.fx or Nvidia.cgfx) Q&A
Where does this fit in? Shading Pipeline Deferred/Forward shading, post-processing, order of passes, high level techniques Scene Manager Spatial partitioning, Camera management, object culling Game Engine Specific Drawable Types Generic models, Particle systems, Animated meshes, Instanced meshes Today! GPU Interface Lowest level portable rendering API. Just a GPU abstraction.
Goals Flexibility Can do anything that the native APIs let us do. No cutting out features. Productivity Much simpler to use than the native APIs. Less code, and less mental tax. Performance Similar CPU frame-time to hand-written native code. Simplicity Keep the interface as small as possible.
Dog food test 22 PC, PS4, Xbox One Don Bradman Cricket (port) PS4, Xbox One Rugby League Live 3 Steam, PS4, PS3, Xbox One, Xbox 360
The GPU pipeline 2005-2015 SM2 Draw: Input Assembler Vertex Shader Rasterizer Pixel Shader Output Merger Resources / Memory (Textures, Buffers, Samplers)
The GPU pipeline 2005-2015 + Vertex texture fetch SM3 Draw: Input Assembler Vertex Shader Rasterizer Pixel Shader Output Merger Resources / Memory (Textures, Buffers, Samplers)
The GPU pipeline 2005-2015 + Geometry Shader & Stream Out stage + Compute shaders SM4 Draw: Input Assembler Vertex Shader Geometry Shader Rasterizer Stream Out Pixel Shader Output Merger Resources / Memory (Textures, Buffers, Samplers) Dispatch: Compute Shader
The GPU pipeline 2005-2015 + Read-Write resources (UAVs) at pixel shader + Tessellation stages SM5 Draw: Input Assembler Vertex Shader Hull Shader Tessellator Domain Shader Geometry Shader Rasterizer Stream Out Pixel Shader Output Merger Resources / Memory (Textures, Buffers, Samplers) Dispatch: Compute Shader
The GPU pipeline 2005-2015 + Read-Write resources at every stage SM5+ Draw: Input Assembler Vertex Shader Hull Shader Tessellator Domain Shader Geometry Shader Rasterizer Stream Out Pixel Shader Output Merger Resources / Memory (Textures, Buffers, Samplers) Dispatch: Compute Shader
The GPU pipeline 2005-2015 Most common features Draw: Input Assembler Vertex Shader Rasterizer Stream Out Pixel Shader Output Merger Resources / Memory (Textures, Buffers, Samplers) Dispatch: Compute Shader
The GPU pipeline 2005-2015 Most common features, API view API states: Input Layout Programs Raster Depth / Stencil Blend Draw Command Input Assembler Vertex Shader Rasterizer Stream Out Pixel Shader Output Merger Resource bindings: Buffer Buffer Buffer / Texture / Sampler Depth Texture Colour Texture
The GPU pipeline 2005-2015 Most common features, API view Programs Dispatch Command Compute Shader Buffer / Texture
Stateless Rendering
Native APIs are state machines Draw(3, TRIANGLES) Behaviour depends on the current state??????????????? Input Assembler Vertex Shader Rasterizer Stream Out Pixel Shader Output Merger???????????????
Native APIs are state machines BindTexture( t ) plug some resources in BindVertexBuffer( v ) BindRenderTarget( r )??????????????? Input Assembler Vertex Shader Rasterizer Stream Out Pixel Shader Output Merger v??? t??? r
Native APIs are state machines SetBlend( OPAQUE ) configure some fixed-function bits SetShaderProgram( s ) plug some procedures in??? s?????? OPAQUE Input Assembler Vertex Shader Rasterizer Stream Out Pixel Shader Output Merger v t??? r
Native APIs are state machines SetInputLayout( l ) SetRaster( SOLID ) SetDepthTest( DISABLED ) l s SOLID DIS- ABLED OPAQUE Draw(3, TRIANGLES) Input Assembler Vertex Shader Rasterizer Stream Out Pixel Shader Output Merger v t r
State machine issues (and features) Objects can specify that they don t care about a state (by not setting it) Don t care states can be inherited from the calling logic. SetBlend( Translucent ) House.Draw() Tree.Draw()
State machine issues and features But this system of inherited state can be very fragile to code modifications. void House::Draw(){ SetBlend( Opaque ) Draw( TRIANGLES,3 ) } SetBlend( Translucent ) House.Draw() Tree.Draw() uh oh!
State machine issues and features It can also lead to inefficiencies as your graphics programmers become pessimistic. void Tree::Draw(){ SetBlend( Opaque )... } void House::Draw(){ SetBlend( Opaque )... }
Stateless Alternative Simplify the API remove the entire state machine concept! Less mental tax no worrying about leaky states Retain the flexibility of don t care states but remove the fragility that it has in state-machine APIs
Draw Items Bundle all native API state and all resource bindings together into a Draw Item. Missing / don t care states are always filled in by some form of default value. Pipeline state Input Layout Raster Depth / Stencil Blend Programs Resources Buffer Texture Sampler Primitives Draw Command
Draw Items draw_item = CreateDrawItem(... ) Submit( draw_item ) Behaviour depends only on the contents of the draw item draw_item Lay out Code Solid Code Less Opaque 3 triangles IA VS Raster PS OM VB Tex Depth Colour
Leaky states Now impossible to get state leakage. Every draw is completely independent and immune to code modifications in other drawing systems. Submit( House.GetDrawItem() ) Submit( Tree.GetDrawItem() )
State Groups Container for Pipeline States and Resource Bindings. Plain-old-data, generated by a writer object. StateGroupWriter sgw sgw.begin() sgw.bindtexture( t ) sgw.bindvertexbuffer( v ) sgw.setblend( Opaque ) sgw.setshaderprogram( s ) StateGroup* sg = sgw.end() Blend Buffer Programs Texture
State Group Stacks Allow different systems to contribute pipeline-states and resource bindings. StateGroup* mesh =... Input Layout Buffer Blend Programs StateGroup* material =... Raster Texture StateGroup* stack[] = {material, mesh}
State Overrides Stack ordering dictates priority for overrides. Placing a state-group at the front of the array causes it s values to be chosen in any state conflicts. StateGroup* mesh =... Input Layout Buffer Blend Programs StateGroup* material =... StateGroup* override =... Blend Raster Texture StateGroup* stack[] = {override, material, mesh}
State Overrides Stack ordering dictates priority for overrides. Placing a state-group at the front of the array causes it s values to be chosen in any state conflicts. override Blend material Blend Programs mesh Input Layout Buffer Raster Texture StateGroup* stack[] = {override, material, mesh}
State Defaults Stack ordering dictates priority for overrides. Placing a state-group at the back of the array causes it s values to only be chosen as a fall-back. StateGroup* mesh =... StateGroup* material =... StateGroup* defaults =... StateGroup* stack[] = {material, mesh, defaults}
State Defaults Stack ordering dictates priority for overrides. Placing a state-group at the back of the array causes it s values to only be chosen as a fall-back. material mesh defaults Blend Programs Input Layout Buffer Input Layout Blend Raster Raster Texture Depth / Stencil Programs StateGroup* stack[] = {material, mesh, defaults}
Compiling a Draw Item Given a stack and a draw command, they can be pre-compiled into a draw item. override material mesh defaults Blend Blend Programs Input Layout Buffer Input Layout Blend Raster Raster Texture Depth / Stencil Programs StateGroup* stack[] = {override, material, mesh, defaults} DrawCommand command = { 3, TRIANGLES } DrawItem* draw = Compile( stack, command ) Draw Command
Compiling a Draw Item Given a stack and a draw command, they can be pre-compiled into a draw item. draw override Blend Draw Command Input material Layout Blend Input Assembler Raster Buffer Programs Vertex Texture Shader Programs mesh Raster Input Buffer Layout Rasterizer Stream Out StateGroup* stack[] = {override, material, mesh, defaults} Texture DrawCommand command = { 3, TRIANGLES } DrawItem* draw = Compile( stack, command ) Pixel Shader Depth / defaults Stencil Input Blend Layout Raster Depth Output / Merger Programs Stencil Draw Command Blend??????
Render Passes Draw Items defined all of the pipeline state except for the Depth/Stencil Target and Render Targets. Render Passes define these destination resources, plus the default and override state groups. RenderPass* pass = CreatePass( depth, color, defaults, override ) StateGroup* stack[] = {override, material, material, mesh } mesh, defaults} DrawCommand command = { 3, TRIANGLES } DrawItem* draw = Compile( stack, command, ) pass ) DrawItem* draws[] = { draw } Submit( pass, draws )
Resource Bindings
Resource ID s (and state ID s) Similar to GL, we use small integer types to refer to resource allocations & views. No reference counting a higher level of the engine can wrap reference counting around this simple integer handle scheme if necessary (a la std::shared_ptr). Helps decouple platform-specific types from the client code. This can be a significant memory saving per compiled Draw Item Pointers are 64 bits! Most resource IDs should fit in <16 bits Some kinds of state IDs might fit in <8 bits! (how many blend modes do you really use?)
Resource slots Most resource binding points are arrays Conflicts are resolved per individual array elements override material Sampler 1 Sampler 2 Blend Programs Sampler 0 Sampler 1 StateGroup* stack[] = {override, material}
Resource slots Resource slots aren t named, only numbered? Sampler 0, Sampler 1, Sampler 2 Constant Buffer 0, Constant Buffer 1, Constant Buffer2 Using this assumption at this level of the engine greatly simplifies development. Our shader programs struct can use a sampler bitmask of 0x05 to indicate that it uses sampler slot #0 and slot #2 (i.e. ((1<<0) (1<<2)) == 0x5) The State Group conflict / merging system is built on super simple integer comparisons.
Resource slots Using numbered slots requires defining convention. Constant Buffer 0 is always used for the per-camera matrix data. Constant Buffer 1 is always used for lighting data etc. This is actually quite useful for magic engine-generated data, which always conforms to a known (hard-coded) structure. such as camera matrices, which you want to automatically plug into every object. These are also a good use for the defaults/overrides state groups!
Resource slots Using named resources requires reflection. To bind data to a named slot, simply use the shader reflection system. Check with the object s shader to discover the number that s associated with that name. This is useful for less rigidly defined structures, such as materials, which may change often during development and vary from object to object.
Input Assembler (D3D11) Binding slots: Input Layout (Formats, strides for each element) Input Assembler Index buffer (Buffer + offset) API states: Input Layout Programs Raster Rasterizer Depth / Stencil Blend Vertex buffer(s) Draw Command Resource bindings: Input Assembler Buffer Vertex Shader Stream Out Buffer Pixel Shader Depth Texture Output Merger Colour Texture (Buffer + offset) Buffer / Texture / Sampler
Input Layouts and Vertex Shaders Input layouts tell the VS where to find the vertex attributes. Stream #0 data: Stream #1 data: Position 1 Position 2 Position 3 TexCoord 1 Normal 1 TexCoord 2 Normal 2 TexCoord3 Normal 3 Offset: Stride: struct VS_Input_Full { float3 p : Position; float2 t : TexCoord; float3 n : Normal } struct VS_Input_Thin { float3 p : Position; }
Input Layouts and Vertex Shaders Lua config files define stream formats (memory layouts for buffers) and vertex formats (VS input structures). StreamFormat("example_stream", { [VertexStream(0)] = { { Float32, 3, Position }, }, [VertexStream(1)] = { { Float32, 3, Normal }, { Float32, 2, TexCoord, 0 }, }, }) VertexFormat("VS_Input_Full", { { "p", float3, Position }, { "t", float2, TexCoord, 0 }, { "n", float3, Normal }, }) InputLayout( "example_stream", "VS_Input_Full" ) InputLayout( "example_stream", "VS_Input_Thin" ) InputLayout( "simple_stream", "VS_Input_Thin" )
Input Assembler (Simplified) Binding slots: Vertex Data Index buffer (Buffer ID + offset) Vertex buffer(s) (Buffer ID + offset) Input Assembler Stream Format Input Layout (now hidden from the user) Instance Data Vertex buffer(s) (Buffer ID + offset)
Shader Resources (D3D11) Binding slots: Constant Buffer View(s) Buffer Pixel Shader Shader Resource View(s) Buffer Texture Draw Command API states: Input Layout Input Assembler Vertex Shader Programs Raster Rasterizer Stream Out repeat for other shader stages Pixel Shader Depth / Stencil Output Merger Blend Sampler(s) Unordered Access view(s) Buffer Texture Resource bindings: Buffer Buffer Depth Texture Colour Texture Buffer / Texture / Sampler
Resource Lists D3D11 allows for 128 texture slots per shader stage. Can we still allow the user to access a hundred textures without the overhead of managing a hundred binding points? How did APIs already solve this for constants / uniforms? Resource lists are constant buffers (UBOs) for texture bindings. Similar to bindless resources. Ports well to Mantle/Vulkan/D3D12 descriptor lists! Only a small number of resource list binding points required. Resource List Diffuse Map ID Normal Map ID Specular Map ID
Shader Resources (simplified) Binding slots: Constant Buffer ID(s) Resource List ID(s) Buffer ID / Texture ID Shader Stages (all) Sampler ID(s) Unordered Access view(s) (Buffer ID / Texture ID)
Draw Item Resources Final size of each draw item is usually <1 cache line Resource List Buffer ID / Texture ID 2 256 bytes Draw Item Constant Buffer ID(s) Resource List ID(s) Sampler ID(s) Unordered Access view(s) (Buffer ID / Texture ID) Raster ID Depth / Stencil Blend ID Program ID Input Assembler Config ID Draw Command 32 80 bytes Input Assembler Config Vertex Data Instance Data 20 128 bytes
State Group Resources Final look at actual State Group members (all optional) State Group Constant Buffer ID(s) Resource List ID(s) Sampler ID(s) Unordered Access view(s) (Buffer ID / Texture ID) Raster ID Depth / Stencil Blend ID Vertex Data Instance Data Technique ID Shader Options Draw Item Program ID
Shaders
Program management Out of the box, shaders are hard to manage. One program = Pixel Shader + Vertex Shader (+Geometry + Tessellation ) Most objects/materials require more than one program. Deferred rendering write GBuffer attributes. Forward rendering compute all shading and lighting. Shadow mapping write depth only. Material LOD enable disable features (e.g. normal mapping at a distance). Loop unrolling compile the shader once for each value of N. All of these programs grouped together form a single Technique.
Techniques, Passes, Options, Permutations A technique is a single shader file (Effect in MS lingo) Each technique contains several passes Gbuffer, Forward, Depth-Only, etc Each pass can contain several options Normal Mapping (y/n), Number of lights [0..8), etc For each technique, for each pass, for each permutation of options, precompile the shader source file into a program Careful each 1-bit option doubles the number of programs!
[FX] syntax All the APIs we use (except mobile/mac/linx) use a shader language that is close enough to HLSL that we can just write all our shader code in HLSL! A header file full of #defines is enough to smooth over the small differences in syntax. However, resource declaration syntax varies widely. Not all platforms support constant buffers (We support prev-gen / D3D9 / GL2 era). Not all platforms support Resource Lists. Not all platforms support separate Textures and Samplers
[FX] syntax Small amount of code generation used to smooth over these issues. We search for comment blocks of the pattern /*[FX] */ and execute their contents as Lua code. The Lua VM has been pre-registered with functions such as below, to create a domain-specific-language for declaring shader resources and techniques/passes/options: CBuffer( slot, stages, name, values ) TextureList( slot, stages, name, values ) Option( name, range ) Pass( slot, name, parameters )
[FX] Examples CBuffer( 0, Pixel, 'Material', { { g_emissive = float }, }) TextureList( 0, Pixel, 'Material', { { Tex2D, 's_diffuse', 'Linear' }, }) Sampler(0, {Pixel,Vertex}, 'Linear', { MinFilter = Linear, MagFilter = Linear, MipFilter = Linear, AddressU = Wrap, AddressV = Wrap, AddressW = Wrap, })
[FX] Examples Pass( 0, 'Opaque', { vertexshader = 'vs_main'; pixelshader = 'ps_main'; vertexlayout = { 'VS_Input_Full' }; pixeloptions = LightCount'; })
Shader Options Shader options are all packed together into a bitmask. Option( 'NormalMapped' ) -- pick a bit for me (use reflection!) Option( 'NormalMapped', {id=3} ) -- mask == 0x8 (i.e. 1<<3) Option( 'LightCount', {id=4, min=1, max=4} ) 7654 3210 0x00 / 0000 0000 == LightCount: 1 0x10 / 0001 0000 == LightCount: 2 0x20 / 0010 0000 == LightCount: 3 0x30 / 0011 0000 == LightCount: 4
Shader Options Given a pass with: Option( 'NormalMapped', {id=0} ) Option( 'LightCount', {id=4, min=1, max=4} ) The permutations would be: 7654 3210 0x00 / 0000 0000 == NormalMapped: 0, LightCount: 1 0x01 / 0000 0001 == NormalMapped: 1, LightCount: 1 0x10 / 0001 0000 == NormalMapped: 0, LightCount: 2 0x11 / 0001 0001 == NormalMapped: 1, LightCount: 2 0x20 / 0010 0000 == NormalMapped: 0, LightCount: 3 0x21 / 0010 0001 == NormalMapped: 1, LightCount: 3 0x30 / 0011 0000 == NormalMapped: 0, LightCount: 4 0x31 / 0011 0001 == NormalMapped: 1, LightCount: 4
Program selection I lied earlier I said that a Render Pass has just a depth-texture, rendertarget(s), defaults state group and overrides state group. A Render Pass also specifies a shader pass integer. Look up the technique, then look up the right pass within the technique and then you ve got a potentially long list of permutations State Group Technique Shader ID Options Render Pass Pass ID Draw Item Program ID Step 1 Step 2 Profit!
Shader Options - runtime Conflict/merging of shader options state is implemented a little differently. State Group State Group Shader Options Technique ID U32 value U32 mask value = 0x04 mask = 0x0F State Group Merged Options = 0x84 Render Pass Pass ID value = 0x80 mask = 0xF0
Permutation selection When compiling your permutations, sort them by CountBitsSet(options_bitmask) such that permutations with more options bits set appear earlier in the array. At runtime, the user creates their own bitmask of requested features. Linearly search through the permutations list, stop when: (requested_options & permutation_options) == permutation_options i.e. stop as soon as you re not delivering options that weren t asked for. You won t necessarily be able to satisfy the user s request exactly, but this algorithm will give them the program that enables as many of their requests as possible.
Permutation Selection (code) int SelectProgramsIndex( u32 techniqueid, u32 passid, u32 featuresrequested ) { Technique& technique = techniques[techniqueid]; List<Pass>& passes = technique.passes; Pass& pass = passes[passid]; List<Permutation>& permutations = pass.permutations; for( int i = 0, end = permutations.count; i!= end; ++i ) { Permutation& permutation = permutations[i]; if( (featuresrequested & permutation.features) == permutation.features ) return permutation.bindingidx; } return -1; }
Q&A? @BrookeHodgman http://tiny.cc/gpuinterface
Thanks! @BrookeHodgman http://tiny.cc/gpuinterface
Bonus slides That I was going to write but then I didn t
GLSL notes GL + GLSL are just specifications vendors create implementations (which are all broken) Validate your shaders using the Khronos reference compiler*. Don t ship your source files. Implement a pre-processor for #include, etc. Obfuscate your shipping code if you feel the need. No guarantees that every vendor will optimize (or compile) your code properly! Implement a GLSL->AST->GLSL optimizing compiler. Or better: a HLSL->AST->GLSL optimizing compiler! Automate this! *http://tiny.cc/khronos
Draw sorting Write a function that hashes a compiled Draw Item. More expensive state changes should be associated with more significant bits in the output. Draw Item IA Config Constant Buffer ID(s) Raster ID Blend ID Shader & pipeline state Textures Resource List ID(s) Sampler ID(s) Unordered Access view(s) (Buffer ID / Texture ID) Depth / Stencil Program ID Input Assembler Config ID Draw Command Hash 0x12345678 Sorting key
Transparent Draw sorting Alpha-blended geometry must be rendered from back to front. Don t use the draw item s hash, use it s distance from the camera. Distance Depth ~*(u32*)distance 0xABCDEF12 Sorting key
Hybrid Draw sorting For opaque geometry to make use of Hi-Z, you want to render front-to-back. However, you also want to sort by state to reduce CPU costs. Compromise by using a hybrid Distance Coarse Depth Original Hash Merge 0xABCD1357 0x12345678 New sorting key Original Sorting key
Redundant state filtering Each draw item is a very compact structure, containing state IDs. XOR ing two draw items creates a bitmask that highlights any changes. Masking out sections of that bitmask and comparing them to zero lets you quickly check if a state has changed since the previous draw item.
Resource Management
Data conditioning / compilation
Shader compilation fun
Devices, contexts & command lists
Devices
Contexts
Multithreading on old APIs
Higher level layer examples
Scene manager
Materials
Lighting