Optimizing Graphics Drivers with Streaming SIMD Extensions. Copyright 1999, Intel Corporation. All rights reserved.

Size: px

Start display at page:

Ellen Thomas
6 years ago
Views:

1 Optimizing Graphics Drivers with Streaming SIMD Extensions 1

2 Agenda Features of Pentium III processor Graphics Stack Overview Driver Architectures Streaming SIMD Impact on D3D Streaming SIMD Impact on OpenGL* Summary Other Tips Resources *Third-party brands and names are the property of their respective owners. 2

3 Features of Streaming SIMD Extensions SIMD FP instructions SIMD FP for basic math & square root Fast approximations for reciprocal and reciprocal square root Cache-ability instructions/features Pre-fetching Non-temporal storage cache Streaming stores New integer SIMD instructions 3

4 Graphics Stack DirectX* D3D Graphics Application D3D "Pipeline" Retained Mode API Immediate Mode API Transformation and Lighting OpenGL* Graphics Application OpenGL ICD HAL/HEL Hardware Hardware *Third-party brands and names are the property of their respective owners. 4

5 Online vs Offline Drivers Traditional "Offline" Driver has at least two passes D3D Pipeline Vertices in D3D "TL Format" Buffers in HAL/HEL Vertices in proprietary format Buffers in AGP "Online" Driver uses a single pass for each "batch" HAL/HEL D3D Pipeline Vertices in D3D "TL Format" Buffers in AGP This example uses DirectX*, but Online Driver is currently more applicable to OpenGL* and other pipelines *Third-party brands and names are the property of their respective owners. 5

6 On-line Driver Advantages Better bus utilization Interleaved read/process/store More CPU/Graphics concurrency Higher TPS (unless "setup bound") Greater efficiency Eliminates a copy operation. Less code to maintain HAL becomes smaller and simpler 6

7 On-line Driver Caveats 3D chip must support vertex format DX: TL_Vertex, flexible vertex formats DX 6.1: Clipping done after pipeline reads vertices back from memory AGP memory is fast to write to, but slow to read from (not cached) OpenGL*: Clip without AGP read re-transform vertices of clipped triangles save vertices in AGP + cached temp buf Don t use On-line Driver for DX 6.1 *Third-party brands and names are the property of their respective owners. 7

8 Streaming SIMD Impact on D3D Optimizations already in DX6.1 Geometry and lighting pipeline IHVs can only optimize in HAL/HEL 3D data setup (e.g., color conversions) Moving data Texture wrap HEL for missing HW feature emulation A thick HEL can greatly benefit 8

9 Streaming SIMD Impact on OpenGL* Optimize for CPU and graphics chip Use single-pass, small batch pipeline Run a batch of vertices through all steps Better SIMD, cache usage, less overhead Intel GE OpenGL add-on for Pentium II and III, Microsoft* or SGI You can optimize ICD for CPU and GC *Third-party brands and names are the property of their respective owners. 9

10 ICD Streaming SIMD Opportunities Transformation and lighting Triangle data setup Back-face culling Clipcode calculations Clipping Color conversions Moving data 10

11 Triangle Setup Division via RCPPS and/or Newton- Raphson Reciprocal area Reciprocal Z Float to Fixed point conversion e.g. XYZ or texture coordinates Other XY modifications e.g. add 0.5 to all X and Y coordinates 11

12 Back-face Culling Some 3D hardware can cull but HAL usually has time to do it Cross product of two triangle edges Sign indicates front/back facing Result is measure of area Often can discard zero area and tiny tri s Some apps often generate zero-area tri s! E.g. discard if area is under 1/16th pixel area E.g. discard if area is under 1/16th pixel area In theory could leave pinholes - rarely seen 12

13 Copying on Pentium III Processor 450Mhz system, BX chipset Copy TL_VERTEX - 32 bytes Memory to memory with MOVNTPS 50 CPU clocks L1 cache to memory using MOVNTPS CPU clocks Numbers should approximate writing to USWC 13

14 Impact of Prefetch Useful on large (many line) loops compute bound or poor CPU/Bus overlap Can prefetch input AND output buffers Usually won t benefit D3D HALs data often passed to HAL in cache might prefetch L2 to L1 for large buffers Can benefit OpenGL* pipeline prefetch vertices and normals prefetch large temp buffer *Third-party brands and names are the property of their respective owners. 14

15 Branches Bad, SIMD Good Cause stalling, not SIMD Some branches can be avoided Conditional move: CMOV, MASKMOVQ Average: PAVGx Sum Absolute Differences: PSADBW Clamp/Saturate: MINPS/MAXPS (FP), PMAXxx/PMINxx PMINxx (INT) Select values: CMPPS, MOVMASK, ANDNPS, ORPS (FP), PCMPxxx, MOVQ, PANDN, POR (INT) Reduce branching 15

16 Minimizing the Negative Effect of Branches Try to move branches outside loops Some branches OK Well predicted branches Know the static branch prediction rules Forward conditional branches are predicted as not taken Backward conditional branches are predicted as taken Branch to avoid large block of code SIMD condition check: MOVMSKPS and PMOVMSKPB Make necessary branches cheaper 16

17 Branch example: Culling Non-SIMD: Culling mode (CW, CCW, None) branch is well predicted (no need to avoid) Triangle facing is not so well predicted Test triangle facing, CMOV pointer to cached memory to replace pointer to AGP memory, CMOV zero to count of bytes written to AGP, Write using pointer, add count to total 17

18 Better Culling in OpenGL* Ordered primitives Test in model space (before transform) reduces transform/lighting work Indexed primitives Test after transform in view or screen space must transform all vertices but not as many as with ordered primitives *Third-party brands and names are the property of their respective owners. 18

19 Better Clipping in OpenGL* Clip code generation is often used gl_ext_clip_volume - useful optimization works especially well for indexed primitives only generate code once per vertex Implement clip-hint extension app hints when objs fully in view *Third-party brands and names are the property of their respective owners. 19

20 Clip-code generation for OpenGL* Test XY in screen space must transform all vertices but not as many as with ordered primitives works well for indexed primitives Test XYZ in model space vs. view frustum drop out-of-view triangles before Xform less transform/lighting to do best for ordered triangles *Third-party brands and names are the property of their respective owners. 20

21 ClipCode Generation Scaler Integer No prefetch 60 clocks/vert Scaler Integer Prefetch 30 clocks/vert SIMD Integer Prefetch 20 clocks/vert All timings from a transform/clip func measure full time per vertex disable clip codes to get Xform time clip code time = total - Xform 21

22 OpenGL* Geometry Vertex Transform: to swizzle, or not? For large vertex sets, transpose to SoA get four vertices into to xxxx, yyyy, zzzz Use MOVLPS/MOVHPS for faster transpose For smaller sets, use SIMD but AoS load xyz, shuffle to xxxx, yyyy, zzzz, MULPS by matrix rows, ADDPS to get XYZW result somewhat slower per vertex than SoA X87 and SIMD FP have different precision Use scalar & packed (SS/PS) 22

23 Fast Transposed-Load Macro #define AosLoad( in, stride, x, y, z, w ) \ { m128 tmp ; \ x = _mm_loadl_pi( x, ( m64 *)(in) ); \ x = _mm_loadh_pi( x, ( m64 *)(stride + (char *)in ) ); \ y = _mm_loadl_pi( y, ( m64 *)(2*stride + (char *)in ) ); \ y = _mm_loadh_pi( y, ( m64 *)(3*stride + (char *)in ) ); \ tmp = _mm_shuffle_ps( x, y, _MM_SHUFFLE( 2, 0, 2, 0 ) );\ y = _mm_shuffle_ps( x, y, _MM_SHUFFLE( 3, 1, 3, 1 ) ); \ x = tmp ; \ \ z = _mm_loadl_pi( z, ( m64 *)(8 + (char *)in ) ); \ z = _mm_loadh_pi( z, ( m64 *)(stride (char *)in ) ); \ w = _mm_loadl_pi( w, ( m64 *)(2*stride (char *)in ) ); \ w = _mm_loadh_pi( w, ( m64 *)(3*stride (char *)in ) ); \ tmp = _mm_shuffle_ps( z, w, _MM_SHUFFLE( 2, 0, 2, 0 ) );\ w = _mm_shuffle_ps( z, w, _MM_SHUFFLE( 3, 1, 3, 1 ) ); \ z = tmp ; } Note that this example uses intrinsics rather than assembler 23

24 Geometry on Pentium III Processor x87 to/from L2 copy to mem 155 clks/vert Graph shows only the transformation step When lighting and projection are considered, the improvement is even bigger x87 to/from L2 52 clks/vert SIMD FP to/from L2 45 clks/vert SIMD FP from NTPS to AGP 40 clocks/vert Accelerates T & L, L, improves cache usage 24

25 OpenGL* RGBA Color Packing Given single precision R,G, B and A, we want to convert to packed 8 bit ints MULPS - multiply all elements by MINPS/MAXPS - clamp/saturate CVTTPS2PI, MOVHLPS, CVTTPS2PI MMX pack to bytes, store pack RGBA EMMS when done with all vertices *Third-party brands and names are the property of their respective owners. 25

26 Summary You can use Pentium III processor s Streaming SIMD Extensions and AGP to deliver more triangles-per-second to the graphics subsystem Wherever there is... vectorizable math or logic or large amounts of data there are big opportunities to run faster with the Pentium III processor 26

27 More Tips Unaligned: use MOVHPS and MOVLPS Cache line split less often than MOVUPS May save values for you (colors) Measure improvement with: Time stamps, frame counters Intel tools: VTune performance tool and ipeak Geometry subtests of major benchmarks: ZD s WinBench 3D geometry subtest (D3D) Sunset s Indy 3D (OpenGL*) *Third-party brands and names are the property of their respective owners. 27

28 Resources Pentium III processor information on the web: Numega's* Softice* * debugger New version supports Pentium III processor *Third-party brands and names are the property of their respective owners. 28

29 More Information 29

30 Intrinsics Pros Does all the register management for you Compiler interprets them, continues to fully optimize Easier to read for some people Some replace several steps Cons About 6% less efficient than the best asm language Must use Intel compiler for now takes a little longer to compile better at some optimizations, worse at others critical code coach (pro?) 30

31 Intel IPEAK Graphics Performance Toolkit Can analyze DirectX* 6.1 apps for: frames/sec triangle usage (#, pixels, # by pixels) texture utilization Helps detect performance limiting factors OpenGL* and D3D supported on Windows* 9x and Windows NT* 4.0 (5.0 soon) *Third-party brands and names are the property of their respective owners. 31

Using Intel Streaming SIMD Extensions for 3D Geometry Processing

Using Intel Streaming SIMD Extensions for 3D Geometry Processing Wan-Chun Ma, Chia-Lin Yang Dept. of Computer Science and Information Engineering National Taiwan University firebird@cmlab.csie.ntu.edu.tw,