Application Note. November 28, Software Solutions Group

Similar documents
Optimizing Graphics Drivers with Streaming SIMD Extensions. Copyright 1999, Intel Corporation. All rights reserved.

SWAR: MMX, SSE, SSE 2 Multiplatform Programming

IA-32 Intel Architecture Software Developer s Manual

Instruction Set Progression. from MMX Technology through Streaming SIMD Extensions 2

Intel SIMD architecture. Computer Organization and Assembly Languages Yung-Yu Chuang 2006/12/25

Intel SIMD architecture. Computer Organization and Assembly Languages Yung-Yu Chuang

Using MMX Instructions to Perform 16-Bit x 31-Bit Multiplication

IA-32 Intel Architecture Software Developer s Manual

SSE2 Optimization OpenGL Data Stream Case Study. Arun Kumar Intel Corporation

Datapoint 2200 IA-32. main memory. components. implemented by Intel in the Nicholas FitzRoy-Dale

CS802 Parallel Processing Class Notes

Using Intel Streaming SIMD Extensions for 3D Geometry Processing

Using MMX Instructions to implement 2X 8-bit Image Scaling

Image Processing Acceleration Techniques using Intel Streaming SIMD Extensions and Intel Advanced Vector Extensions

Stack -- Memory which holds register contents. Will keep the EIP of the next address after the call

Registers. Ray Seyfarth. September 8, Bit Intel Assembly Language c 2011 Ray Seyfarth

Intel 64 and IA-32 Architectures Software Developer s Manual

Deriving Triangle Plane Equations

/INFOMOV/ Optimization & Vectorization. J. Bikker - Sep-Nov Lecture 2: Low Level. Welcome!

Using MMX Instructions to Implement a 1/3T Equalizer

The Skeleton Assembly Line

IA-32 Intel Architecture Software Developer s Manual

An Efficient Vector/Matrix Multiply Routine using MMX Technology

/INFOMOV/ Optimization & Vectorization. J. Bikker - Sep-Nov Lecture 3: Low Level. Welcome!

CS220. April 25, 2007

P.G.TRB - COMPUTER SCIENCE. c) data processing language d) none of the above

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

Optimizing Memory Bandwidth

EECE.3170: Microprocessor Systems Design I Summer 2017 Homework 4 Solution

Accelerating 3D Geometry Transformation with Intel MMX TM Technology

High Performance Computing. Classes of computing SISD. Computation Consists of :

The Instruction Set. Chapter 5

srl - shift right logical - 0 enters from left, bit drops off right end note: little-endian bit notation msb lsb "b" for bit

Assembly level Programming. 198:211 Computer Architecture. (recall) Von Neumann Architecture. Simplified hardware view. Lecture 10 Fall 2012

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2018 Lecture 4

SSE and SSE2. Timothy A. Chagnon 18 September All images from Intel 64 and IA 32 Architectures Software Developer's Manuals

CS Bootcamp x86-64 Autumn 2015

Systems I. Machine-Level Programming I: Introduction

Computer Labs: Profiling and Optimization

Instruction Set extensions to X86. Floating Point SIMD instructions

Assembly Programmer s View Lecture 4A Machine-Level Programming I: Introduction

Computer Architecture and Assembly Language. Practical Session 5

IN5050: Programming heterogeneous multi-core processors SIMD (and SIMT)

Using MMX Instructions to Perform Simple Vector Operations

12.1 Chapter Overview Tables Function Computation via Table Look-up. Calculation Via Table Lookups

Assembly Language Each statement in an assembly language program consists of four parts or fields.

Binary Logic (review)

Lecture (05) x86 programming 4

Intel 64 and IA-32 Architectures Software Developer s Manual

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5

CSC201, SECTION 002, Fall 2000: Homework Assignment #3

ADDRESSING MODES. Operands specify the data to be used by an instruction

Intel Architecture Optimization

Machine and Assembly Language Principles

Intel Architecture MMX Technology

x86 assembly CS449 Fall 2017

Interfacing Compiler and Hardware. Computer Systems Architecture. Processor Types And Instruction Sets. What Instructions Should A Processor Offer?

CSE P 501 Compilers. x86 Lite for Compiler Writers Hal Perkins Autumn /25/ Hal Perkins & UW CSE J-1

SSE/SSE2 Toolbox Solutions for Real-Life SIMD Problems

Advanced Procedural Texturing Using MMX Technology

Superscalar Processors

Topics Power tends to corrupt; absolute power corrupts absolutely. Computer Organization CS Data Representation

Assembly Language for Intel-Based Computers, 4 th Edition. Chapter 2: IA-32 Processor Architecture Included elements of the IA-64 bit

x86 Code Optimization Guide

x86 Programming I CSE 351 Winter

CS412/CS413. Introduction to Compilers Tim Teitelbaum. Lecture 21: Generating Pentium Code 10 March 08

CMSC Lecture 03. UMBC, CMSC313, Richard Chang

Floating Point Instructions

Machine Code and Assemblers November 6

Instruction Set Architectures

Exploiting Stack Buffer Overflows Learning how blackhats smash the stack for fun and profit so we can prevent it

Instruction Set Architectures

Today: Machine Programming I: Basics

1 Overview of the AMD64 Architecture

Program Exploitation Intro

MACHINE-LEVEL PROGRAMMING IV: Computer Organization and Architecture

X86 Addressing Modes Chapter 3" Review: Instructions to Recognize"

History of the Intel 80x86

Credits and Disclaimers

Overview of Compiler. A. Introduction

Improving the compute performance of video processing software using AVX (Advanced Vector Extensions) instructions

Machine-level Representation of Programs. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

CS 261 Fall Mike Lam, Professor. x86-64 Data Structures and Misc. Topics

mith College Computer Science Fixed-Point & Floating-Point Number Formats CSC231 Dominique Thiébaut

Understand the factors involved in instruction set

Chapter 3: Addressing Modes

Machine-Level Programming I: Introduction Jan. 30, 2001

UNIT 2 PROCESSORS ORGANIZATION CONT.

System Software Assignment 1 Runtime Support for Procedures

/INFOMOV/ Optimization & Vectorization. J. Bikker - Sep-Nov Lecture 2: Low Level. Welcome!

Power Estimation of UVA CS754 CMP Architecture

Machine-Level Programming Introduction

Computer Systems Architecture I. CSE 560M Lecture 3 Prof. Patrick Crowley

AMD Opteron TM & PGI: Enabling the Worlds Fastest LS-DYNA Performance

CSCI 402: Computer Architectures

Computer Systems Lecture 9

Computer Labs: Profiling and Optimization

Assembly Language for Intel-Based Computers, 4 th Edition. Lecture 25: Interface With High-Level Language

System calls and assembler

Winter Compiler Construction T11 Activation records + Introduction to x86 assembly. Today. Tips for PA4. Today:

Transcription:

)DVW&RORU&RQYHUVLRQ8VLQJ 6WUHDPLQJ6,0'([WHQVLRQVDQG 00;Œ7HFKQRORJ\ Application Note November 28, 2001 Software Solutions Group

7DEOHRI&RQWHQWV 1.0 INTRODUCTION... 3 2.0 BACKGROUND & BASICS... 3 2.1 A COMMON EXAMPLE... 3 2.2 A CLOSER LOOK AT THE INSTRUCTION STREAM... 4 3.0 THE NEW AND IMPROVED ALGORITHM... 5 3.1 DIGGING A LITTLE DEEPER... 6 4.0 WRAPPING IT UP... 8 APPENDIX A: EXAMPLE SOURCE... 9 TABLE OF FIGURES FIGURE 1: RGB-SOA DATA... 5 FIGURE 2: RGB-AOS DATA... 6 FIGURE 3: INITIAL DATA CONFIGURATION... 6 FIGURE 4: CVTPS2PI INSTRUCTION... 7 FIGURE 5: SHUFFLE EXAMPLE... 7 FIGURE 6: SHIFTING THE DATA... 7 FIGURE 7: FINAL RESULTS... 8 2 Revision 0.5

1.0 Introduction A very basic understanding of the Pentium 4 architecture and instructions is expected for understanding this technique. Information on the Pentium 4 processor including documentation, optimization guide, and reference manuals can be found at: http:developer.intel.com/ids. Datatype color conversions are a common requirement in 3-D application pipelines. In a simple lighting scheme, these conversions happen at least once per color channel, red, green, blue (R, G, B) per vertex, and in a more realistic lighting scheme, such values are often calculated multiple times for different light components. Color conversion from single-precision floating point values to integer values is typically done by using simple casts, for example intvalr = (int)fpvalr; where intval is an integer and fpval a singleprecision floating point number. 2.0 Background & Basics This approach can be less than optimal on an out-of-order execution CPU architecture with a large execution pipeline such as the Pentium 4 processor. The technique described in this paper makes possible an alternative, more architectural friendly approach to using simple high-level-language casts to solve this problem on data sets of specific configurations. The result is a faster method for doing such conversions using the Streaming SIMD Extension (SSE) instructions and their associated registers, in conjunction with the MMX technology registers and instructions. 2.1 A Common Example The following pseudocode applies diffuse and specular lighting to a vertice: 1 D3D MACRO 2 D3DRGB(r, g, b) \ 3 (0xff000000L ( ((long)((r) * 255)) << 16) \ 4 (((long)((g) * 255)) << 8) (long)((b) * 255)) 5 void CLightTransform::Light_x87c() 6 { 7 Set initial ambient RGB 8 Calculate the local diffuse RBB color 9 for( idx=0; idx<=numvertices; idx++ ) 10 { 11 Calculate Diffuse RGB contribution 12 Note this is done per vertex! 13 DiffuseColor.r = 14 DiffuseColor.g = 15 DiffuseColor.b = 16 Assign color to screen vertices 17 ScreenVertices[idx].color = D3DRGB(DiffuseColor.r, DiffuseColor.g, DiffuseColor.b ); 18 Calculate specular RGB contribution 19 Note this is done per vertex! 3

20 SpecularColor.r = 21 SpecularColor.g = 22 SpecularColor.b = 23 Assign color to screen vertices 24 ScreenVertices[idx].specular = D3DRGB(SpecularColor.r, SpecularColor.g, SpecularColor.b ); 25 } 26 } Light_x87c Listing 1 On lines 17 and 24 in Listing 1, we call the D3DRGB macro (as defined on lines 2-4 and by the Microsoft DirectX API interface) to perform our casts and color conversions for both specular and diffuse lighting. Using Microsoft* Visual Studio 6.0 with the Microsoft C++ compiler, these casts each break down to the following simple example: 1 float fpnum; 2 intnum = (long)dy_sum ; 3 fld dword ptr [fpnum] 4 call ftol 5 mov dword ptr [intnum], eax Listing 2 Whereas each call to ftol on line 4 of Listing 2 equates to: 1 ftol: 2 push ebp 3 mov ebp,esp 4 esp,fffffff4 5 wait 6 fnstcw [ebp-02] 7 wait 8 mov ax,word ptr [ebp-02] 9 or ah,0c 10 mov word ptr [ebp-04],ax ; Partial register stall 11 fldcw [ebp-04] ; Instruction stream serialization 12 fistp qword ptr [ebp-0c] 13 fldcw [ebp-02] ; Instruction stream serialization 14 mov eax,dword ptr [ebp-0c] ; memory stall 15 mov edx,dword ptr [ebp-08] 16 leave 17 ret Listing 3 2.2 A Closer Look at the Instruction Stream This instruction stream contains a few issues that constrain the processor from optimally processing the code. These issues are highlighted in the comments contained in Listing 3 above. To summarize: The partial stall occurs on line 10 when a 16-bit register (for example, AX) is read immediately after an 8-bit register (for example, AL, AH) is written, the read is stalled until the write retires. The instruction stream becomes serialized on line 11 as the floating-point control word is loaded with the fldcw instruction. Serializing instructions constrain 4 Revision 0.5

speculative execution by defeating the dynamic execution feature of the processor. Examples of offending instructions include fldcw and cpuid. The memory stall occurs on line 14 because the 32-bit load to eax was preceded by a 16-bit store from ax. See the Pentium 4 processor documentation, optimization guide, and reference manuals at: http:developer.intel.com/ids for full details on specific architectural features and coding pitfalls. The casts performed on each R, G, B channel for diffuse and specular lighting in Listing 1 can be improved. The method described below improves this method in two ways: 1) It avoids the less-optimal code paths and the computational cost associated with such methods. 2) It improves the throughput of the data by using a SIMD method. For example, in the case of the Pentium 4 processor, throughput increases from processing 1 data element at a time to processing 4 data elements at a time. 3.0 The New and Improved Algorithm We ll limit our further discussion to converting the diffuse light components, although the same technique would apply to the specular components. This technique will make two assumptions: 1) Data for SIMD instructions is efficiently organized in a structure of arrays format (SOA), where the vertex data is in a XXXX YYYY ZZZZ WWWW format in the registers or data types, rather than a XYZW XYZW XYZW XYZW, array of structures, (AOS) format. 2) The rendering engine used requires data to be submitted as XYZW, this is a common practice for both hardware accelerators and software rasterizers, and required by most APIs. It is possible to use variations of this technique with different data structures, however that discussion is outside the scope of this document. In its simplest form, the algorithm takes SIMD data (RGB-SOA) in the form: R3 FP R2 FP R1 FP R0 FP G3 FP G2 FP G1 FP G0 FP B3 FP B2 FP B1 FP B0 FP Figure 1: RGB-SOA Data 5

And converts it to integers in the following (RGB-AOS) SIMD format: -- R3 INT G3 INT B3 INT -- R2 INT G2 INT B2 INT -- R1 INT G1 INT B1 INT Using the following steps: -- R0 INT G0 INT B0 INT Figure 2: RGB-AOS Data 1) Convert SIMD-Floating Point data into SIMD-Integer data 2) Shift data left to prepare for packing 3) Combine the integer data in a packed format using logical operations 3.1 Digging a Little Deeper 1) Convert SIMD-FP data into SIMD-INT data. Assume the following components are arranged in the registers as described in figure 5. 127-96 95-64 63-32 31-0 63..32 31..0 xmm2 R3 FP R2 FP R1 FP R0 FP to 000000R1 INT 000000R0 INT mm6 000000R3 INT 000000R2 INT mm7 xmm5 G3 FP G2 FP G1 FP G0 FP to 000000G1 INT 000000G0 INT mm1 000000G3 INT 000000G2 INT mm4 xmm1 B3 FP B2 FP B1 FP B0 FP to 000000B1 INT 000000B0 INT mm2 000000B3 INT 000000B2 INT mm5 Using the following code, Figure 3: initial data configuration cvtps2pi mm6, xmm2 mm6 = (int)r1,(int)r0 shufps xmm2, xmm2, 0xEE r3,r2,r3,r2 cvtps2pi mm7, xmm2 mm7 = (int)r3,(int)r2 cvtps2pi mm1, xmm5 mm1 = (int)g1,(int)g0 shufps xmm5, xmm5, 0xEE g3,g2,g3,g2 cvtps2pi mm4, xmm5 mm4 = (int)g3,(int)g2 cvtps2pi mm2, xmm1 mm2 = (int)b1,(int)b0 shufps xmm1, xmm1, 0xEE b3,b2,b3,b2 cvtps2pi mm5, xmm1 mm5 = (int)b3,(int)b2 The cvtps2pi will convert packed single precision floating-point data to a packed integer representation, the truncate / chop-rounding mode is implicitly encoded in the instruction, thereby taking precedence over the rounding mode specified in the MXCSR register. This can eliminate the need to change the rounding mode from round-nearest, to truncate / chop, and then back to round-nearest to resume computation. 6 Revision 0.5

Figure 4: cvtps2pi instruction We then shuffle the data using the shufps instruction. The shuffle instructions use a one byte immediate, where the high and low order nibbles correspond to the two low and two high elements in the destination register, each two bits are then used to identify which element in the source register to move to that location in the destination register. The best way to understand the functionality is through an example. If we were to take the instruction shufps xmm0, xmm1, 0x2D, we would achieve the result in figure 5. Figure 5: Shuffle Example 2) Shift data left to prepare for packing using the pslld instruction 16, 8, or 0 bits to prepare for packing while shifting in 0 s. We won t be shifting the blue components, as they will simply be OR ed together to achieve the final result. Using the following code: 63..32 31..0 00R1 INT 0000 00R0 INT 0000 mm6 00R3 INT 0000 00R2 INT 0000 mm7 0000G1 INT 00 0000G0 INT 00 mm1 0000G3 INT 00 0000G2 INT 00 mm4 000000B1 INT 000000B0 INT mm2 000000B3 INT 000000B2 INT mm5 Figure 6: Shifting the Data pslld mm6, 0x10 mm6 = r1<<16,r0<<16 pslld mm7, 0x10 mm7 = r3<<16,r2<<16 pslld mm1, 0x08 mm1 = g1<<8,g0<<8 pslld mm4, 0x08 mm4 = g3<<8,g2<<8 7

3) Combine the integer data in a packed format using the logical OR operation por to achieve the final result. Using the following code: 63..32 31..0 00R1 INT G1 INT B1 INT 00R0 INT G0 INT B0 INT mm6 00R3 INT G3 INT B3 INT 00R2 INT G2 INT B2 INT mm7 Figure 7: Final Results por mm6, mm1 bitwise OR red(0,1) and green(0,1) por mm6, mm2 bitwise OR with blue(0,1) result 0,1 por mm7, mm4 bitwise OR red(2,3) and green(2,3) por mm7, mm5 bitwise OR with blue(2,3) result 2,3 4.0 Wrapping It Up This description provides the basic understanding of a technique of combining alternative instruction streams to improve algorithm performance when using specific data arrangement. Numerous assumptions were made, and variations and improvements are possible that can be adapted to best fit your data structures and performance requirements. A complete and commented function source listing is available in Appendix A. 8 Revision 0.5

Appendix A: Example Source ----------------------------------------------------------------------------- SOAtoAOS - SOA to AOS with SSE and MMX Technology ASM. This routine gets its input as SOA. In each iteration it transposes four vertices into AOS format (xyz) and puts the info back in m_pscreenvertices to be drawn with D3D, we're using MMX code to do the color conversion rather than just calling D3DMAKE. The code below should replace the following C code. for (i=0, j=0; i<numvecs; i++, j+=vectorsize) { psx = (float*)&(qscreenvertices.sx[i]); psy = (float*)&(qscreenvertices.sy[i]); psz = (float*)&(qscreenvertices.sz[i]); prhw = (float*)&(qscreenvertices.rhw[i]); pdiffuser = (float*)&(qscreenvertices.diffuser[i]); pdiffuseg = (float*)&(qscreenvertices.diffuseg[i]); pdiffuseb = (float*)&(qscreenvertices.diffuseb[i]); pspecularr = (float*)&(qscreenvertices.specularr[i]); pspecularg = (float*)&(qscreenvertices.specularg[i]); pspecularb = (float*)&(qscreenvertices.specularb[i]); m_pscreenvertices[j].sx = psx[0]; m_pscreenvertices[j+1].sx = psx[1]; m_pscreenvertices[j+2].sx = psx[2]; m_pscreenvertices[j+3].sx = psx[3]; m_pscreenvertices[j].sy = psy[0]; m_pscreenvertices[j+1].sy = psy[1]; m_pscreenvertices[j+2].sy = psy[2]; m_pscreenvertices[j+3].sy = psy[3]; m_pscreenvertices[j].sz = psz[0]; m_pscreenvertices[j+1].sz = psz[1]; m_pscreenvertices[j+2].sz = psz[2]; m_pscreenvertices[j+3].sz = psz[3]; m_pscreenvertices[j].rhw = prhw[0]; m_pscreenvertices[j+1].rhw = prhw[1]; m_pscreenvertices[j+2].rhw = prhw[2]; m_pscreenvertices[j+3].rhw = prhw[3]; m_pscreenvertices[j].color = D3DRGB(pdiffuseR[0], pdiffuseg[0], diffuseb[0]); m_pscreenvertices[j].specular = D3DRGB(pspecularR[0], pspecularg[0], pspecularb[0]); m_pscreenvertices[j+1].color = D3DRGB(pdiffuseR[1], pdiffuseg[1], pdiffuseb[1]); m_pscreenvertices[j+1].specular = D3DRGB(pspecularR[1], pspecularg[1], pspecularb[1]); m_pscreenvertices[j+2].color = D3DRGB(pdiffuseR[2], pdiffuseg[2], pdiffuseb[2]); m_pscreenvertices[j+2].specular = D3DRGB(pspecularR[2], pspecularg[2], pspecularb[2]); m_pscreenvertices[j+3].color = D3DRGB(pdiffuseR[3], pdiffuseg[3], pdiffuseb[3]); m_pscreenvertices[j+3].specular = D3DRGB(pspecularR[3], pspecularg[3], pspecularb[3]); } "unpckhps xmm1, xmm2" works like this (and low is just the opposite): ------------- ------------- XMM1 x4 x3 x2 x1 XMM2 y4 y3 y2 y1 ------------- ------------- ------------- XMM1 y4 x4 y3 x3 ------------- "shufps xmm1, xmm2, 0x44" which selects bits 0-31 & 32-63 9

------------- ------------- XMM1 x4 x3 x2 x1 XMM2 y4 y3 y2 y1 ------------- ------------- ------------- XMM1 x2 x1 y2 y1 ------------- Inputs: pvertex - pointer to the vertices numvecs - number of vectors Outputs: m_pscreenvertices - the transformed, lit vertices Return Value: none ----------------------------------------------------------------------------- inline void SOAtoAOS( D3DVERTEX *pvertex, int numvecs ) { int i, j; m128 c = _mm_set_ps1(255.0f); float *pin, *pout; int idx = m_size*4; for (i=0, j=0; i<numvecs; i++, j+=vectorsize) { pin = ((float*)&(qscreenvertices.sx[i])); pout = ((float*)&(m_pscreenvertices[j].sx)); int offsetfrombasesoa = j*4; Used as an offset to arrive at the correct memory _asm { mov eax,idx number of vectors per point (e.g. x) into eax mov edx, pin ress start of our SOA data xmm7,[edx]; x4,x3,x2,x1 xmm4,xmm7 copy x4,x3,x2,x1 unpcklps xmm7,[edx] y2,x2,y1,x1 unpckhps xmm4,[edx] y4,x4,y3,x3 Data of ress [edx] already in cac xmm6,[edx] z4,z3,z2,z1 xmm2, xmm6 copy z4,z3,z2,z1 unpcklps xmm6,[edx] rhw2,z2,rhw1,z1 unpckhps xmm2,[edx] rhw4,z4,rhw3,z3 xmm3, xmm7 copy y2,x2,y1,x1 shufps xmm7,xmm6,0x44 rhw1,z1,y1,x1 Got our 1st result! shufps xmm3,xmm6,0xee rhw2,z2,y2,x2 Got our 2nd result! xmm6,xmm4 copy y4,x4,y3,x3 shufps xmm4,xmm2,0x44 rhw3,z3,y3,x3 Got our 3rd result! shufps xmm6,xmm2,0xee rhw4,z4,y4,x4 Got our 4th result! pointer to diffuser xmm1, [c] xmm2, [edx] mulps xmm2, xmm1 diffuser r4,r3,r2,r1 * 255.0f xmm5, [edx] mulps xmm5, xmm1 diffuseg g4,g3,g2,g1 * 255.0f mulps xmm1, [edx] diffuseb b4,b3,b2,b1 * 255.0f cvtps2pi mm0, xmm2 mm0 = (int)r2,(int)r1 shufps xmm2, xmm2, 0xEE r4,r3,r4,r3 cvtps2pi mm3, xmm2 mm3 = (int)r4,(int)r3 cvtps2pi mm1, xmm5 mm1 = (int)g2,(int)g1 shufps xmm5, xmm5, 0xEE g4,g3,g4,g3 cvtps2pi mm4, xmm5 mm4 = (int)g4,(int)g3 cvtps2pi mm2, xmm1 mm2 = (int)b2,(int)b1 shufps xmm1, xmm1, 0xEE b4,b3,b4,b3 cvtps2pi mm5, xmm1 mm5 = (int)b4,(int)b3 10 Revision 0.5

pslld mm0, 0x10 mm0 = r2<<16,r1<<16 pslld mm3, 0x10 mm3 = r4<<16,r3<<16 pslld mm1, 0x08 mm1 = g2<<8,g1<<8 pslld mm4, 0x08 mm4 = g4<<8,g3<<8 por mm0, mm1 bitwise OR red(1,2) and green(1,2) por mm0, mm2 bitwise OR with blue(1,2) result 1,2 por mm3, mm4 bitwise OR red(3,4) and green(3,4) por mm3, mm5 bitwise OR with blue(3,4) result 3,4 pointer to specularr xmm1, [c] xmm2, [edx] mulps xmm2, xmm1 specularr r4,r3,r2,r1 * 255.0f xmm5, [edx] mulps xmm5, xmm1 specularg g4,g3,g2,g1 * 255.0f mulps xmm1, [edx] specularb b4,b3,b2,b1 * 255.0f cvtps2pi mm6, xmm2 mm6 = (int)r2,(int)r1 shufps xmm2, xmm2, 0xEE r4,r3,r4,r3 cvtps2pi mm7, xmm2 mm7 = (int)r4,(int)r3 cvtps2pi mm1, xmm5 mm1 = (int)g2,(int)g1 shufps xmm5, xmm5, 0xEE g4,g3,g4,g3 cvtps2pi mm4, xmm5 mm4 = (int)g4,(int)g3 cvtps2pi mm2, xmm1 mm2 = (int)b2,(int)b1 shufps xmm1, xmm1, 0xEE b4,b3,b4,b3 cvtps2pi mm5, xmm1 mm5 = (int)b4,(int)b3 pslld mm6, 0x10 mm6 = r2<<16,r1<<16 pslld mm7, 0x10 mm7 = r4<<16,r3<<16 pslld mm1, 0x08 mm1 = g2<<8,g1<<8 pslld mm4, 0x08 mm4 = g4<<8,g3<<8 por mm6, mm1 bitwise OR red(1,2) and green(1,2) por mm6, mm2 bitwise OR with blue(1,2) result 1,2 por mm7, mm4 bitwise OR red(3,4) and green(3,4) por mm7, mm5 bitwise OR with blue(3,4) result 3,4 movq mm1, mm0 copy diffuse1, diffuse2 punpckldq mm0, mm6 interleave mm0 = specular1, diffuse1 punpckhdq mm1, mm6 interleave mm1 = specular2, diffuse2 movq mm2, mm3 copy diffuse4, diffuse3 punpckldq mm3, mm7 interleave mm3 = specular3, diffuse3 punpckhdq mm2, mm7 interleave mm1 = specular4, diffuse4 mov edx, pout Move the base ress of the memory where SOA vertices will be stored into edx } EMMS } SOAtoAOS #endif } Write final values to new SOA memory segment movups [edx], xmm7 Write out vertex movq [edx+0x10], mm0 Write out diffuse and specular color 1 movups [edx+0x20], xmm3 movq [edx+0x30], mm1 Write out diffuse and specular color 2 movups [edx+0x40], xmm4 movq [edx+0x50], mm3 Write out diffuse and specular color 3 movups [edx+0x60], xmm6 movq [edx+0x70], mm2 Write out diffuse and specular color 4 emms instruction macro 11