CS802 Parallel Processing Class Notes MMX Technology Instructor: Dr. Chang N. Zhang Winter Semester, 2006
Intel MMX TM Technology Chapter 1: Introduction to MMX technology 1.1 Features of the MMX Technology - MMX technology: to accelerate multimedia and communication by adding new instructions and defining new 64-bit data types. - MMX technology introduces new general-purpose instructions. These instructions operate in parallel on multiple data elements packed into 64-bit quantities. These instructions accelerate the performance of applications with compute-intensive algorithms that perform localized, recurring operations on small native data. These applications include motion video, combined graphics with video, image processing, audio, synthesis, speech synthesis and compression, telephony, video conferencing, 2D graphics, and 3d graphics. - Single Instruction, Multiple Data (SIMD) technique. The MMX technology uses SIMD technique to speed up software performance by processing multiple data elements in parallel, using a single instruction. The MMX technology supports parallel operations on byte, word, and doubleword data elements, and the new quadword (64-bit) integer data type. - 57 new instructions - Eight 64-bit wide MMX registers (MM0~MM7) - Four new data types 1.2 Advantages of the MMX Technology - SIMD provides parallelism, greatly increase performance on the PC platform - MMX technology is integrated into Intel Architecture (IA) processors, fully compatible with existing OS. - IA software will run on MMX technology-enabled system - MMX technology be used in applications, algorithm, and drivers 1
Chapter 2: MMX New Data Types & MMX Registers 2.1 MMX New Data Types The principal data type of the IA MMX technology is the packed fixed-point integer. The decimal point of the fixed-point values is implicit and is left for the user to control for maximum flexibility. The IA MMX technology defines the following four new 64-bit quantity: (1) Packed byte: Eight bytes packed into one 64-bit quantity (2) Packed word: Four words packed into one 64-bit quantity (3) Packed doubleword: Two words packed into one 64-bit quantity (4) Quadword : one 64-bit quantity 2.2 MMX Registers The IA MMX technology provides eight 64-bit, general-purpose registers. The registers are aliased on the floating-point registers. The operating system handles the MMX technology as it would handle floating-point. The MMX registers can hold packed 64-bit data types. The MMX instructions access the MMX registers directly using the register names MM0 to MM7. The MMX registers can be used to perform calculations on data. They cannot be used to address memory; addressing is accomplished by using the integer registers and standard IA addressing modes. 2
Chapter 3: MMX Instructions (Total 57) Overview 3.1 Types of Instructions arithmetic: add, subtract, multiply, arithmetic shift and multiply add. comparison: logic: AND, AND NOT, OR, and XOR shift: conversion: data transfer: EMMS: empty MMX state 3.2 MMX Instructions: Syntax Typical MMX instruction: -- Prefix: P for Packed -- Instruction operation: for example, ADD, CMP, XOR -- Suffix: US for Unsigned Saturation S for Signed saturation B,W, D, Q for the data type: Example: PADDUSW Packed Add Unsigned with Saturation for word 3.3 MMX Instructions: Format For data transfer instruction: -- destination and source operands can reside in memory, integer registers, or MMX registers For all other IA MMX instructions: -- destination operand: MMX register -- source operand: MMX register, memory, or immediate operands 3
3.4 MMX Instructions: Conventions source operand: at right place destination operand: at left place e.g. PSLLW mm, mm/m64 memory address: as the least significant byte of the data 3.5 MMX Instructions: Conventions Wrap Around: if overflow or underflow, a data is truncated, only the lower (least significant) bits are returned. Carry is ignored. Saturation: if overflow or underflow, a data is clipped (saturated) to a datarange limit for the data type. lower limit upper limit signed byte 80H 7FH signed word 8000H 7FFFH unsigned byte 00H FFH unsigned word 0000H FFFFH e.g for unsigned byte, e5h+62h= ffh (saturation) e5h+62h= 47H (wrap around) 4
Chapter 4: MMX Instructions 4.1 Arithmetic (PADD, Wrap around) PADDB mm, mm/m64, Operation as: mm(7 0) mm(7 0) + mm/m64(7...0) mm(15 8) mm(15 8) + mm/m64(15 8). mm(63 56) mm(63 56) +mm/m64(63 56) PADDW mm, mm/m64, Operation as: mm(15 0) mm(15 0) + mm/m64(15...0) mm(31 16) mm(31 16) + mm/m64(31 16). mm(63 48) mm(63 48) + mm/m64(63 48) 5
PADDD mm, mm/m64, Operation as: mm(31 0) mm(31 0) + mm/m64(31...0) mm(63 32) mm(63 32) + mm/m64(63 32) 4.2 Arithmetic (PADD, saturation) PADDSB mm, mm/m64, Operation as: mm(7 0) SaturateToSignedByte( mm(7 0) + mm/m64(7...0)) mm(15 8) SaturateToSignedByte( mm(15 8) + mm/m64(15 8)). mm(63 56) SaturateToSignedByte( mm(63 56) +mm/m64(63 56)) PADDSW mm, mm/m64, Operation as: mm(15 0) SaturateToSignedWord( mm(15 0) + mm/m64(15...0)) mm(31 16) SaturateToSignedWord( mm(31 16) + mm/m64(31 16)). mm(63 48) SaturateToSignedWord( mm(63 48) + mm/m64(63 48)) 6
4.3 Arithmetic Packed Add Unsigned with Saturation --- PADDUSB mm, mm/m64 --- PADDUSW mm, mm/m64 Subtraction: --- PSUB[B,W,D] mm, mm/m64 (Wrap Around) --- PSUBS[B,W] mm, mm/m64 (Saturation) --- PSUBUS[B,W] mm, mm/m64 (Saturation) 4.4 Arithmetic Packed Multiply and Add --- PMADDWD mm, mm/m64, Multiply the packed word by the packed word in MMX reg/memory. Add the 32-bit results pairwise and store in MMX register as dword. Packed Multiply High --- PMULHW mm, mm/m64, Multiply the signed packed word in MMX register with the signed packed word in MMX reg/memory, then store the high-order 16 bits of the result in MMX register. mm(15 0) (mm(15 0) * mm/m64(15...0)) (31 16); mm(31 16) (mm(31 16) * mm/m64(31 16)) (31 16); mm(47 32) (mm(47 32) * mm/m64(47 32)) (31 16); mm(63 48) (mm(63 48) * mm/m64(63 48)) (31 16); Packed Multiply Low --- PMULHL mm, mm/m64, Multiply the signed packed word in MMX register with the signed packed word in MMX reg/memory, then store the low-order 16 bits of the result in MMX register. mm(15 0) (mm(15 0) * mm/m64(15...0)) (15 0); 7
mm(31 16) (mm(31 16) * mm/m64(31 16)) (15 0); mm(47 32) (mm(47 32) * mm/m64(47 32)) (15 0); mm(63 48) (mm(63 48) * mm/m64(63 48)) (15 0); 4.5 Comparison Packed Compare for Equality [byte, word, doubleword] --- PCMPEQB mm, mm/m64, Return (0xff, or 0) --- PCMPEQW mm, mm/m64, Return (0xffff, or 0) --- PCMPEQD mm, mm/m64, Return (0xffffffff, or 0) Packed Compare for Greater than --- PCMPGT[B, W,Q]; 4.6 Logic Bit-wise Logical Exclusive OR --- PXOR mm, mm/m64, mm mm XOR mm/m64 Bit-wise Logical AND --- PAND mm, mm/m64, mm mm AND mm/m64 Bit-wise Logical AND NOT --- PANDN mm, mm/m64, mm (NOT mm) AND mm/m64 Bit-wise Logical OR --- POR mm, mm/m64, mm mm OR mm/m64 4.7 Shift Packed shift left logical (Shifting in zero) --- PSLL[W, D, Q] mm, mm/m64, Packed shift Right logical (Shifting in zero) --- PSRL[W, D,Q] mm, mm/m64, Packed shift right arithmetic (Shifting in sign bits) --- PSRA[W, D] mm, mm/m64, 8
4.8 Conversion Pack with unsigned saturation --- PACKUSWB mm, mm/m64, Pack and saturate signed words from MMX register and MMX register /memory into unsigned bytes in MMX register. mm(7 0) SaturateSignedWordToUnsignedByte mm(15...0); mm(15 8) SaturateSignedWordToUnsignedByte mm(31 16); mm(23 16) SaturateSignedWordToUnsignedByte mm(47 32); mm(31 24) SaturateSignedWordToUnsignedByte mm(63 48); mm(39 32) SaturateSignedWordToUnsignedByte m/m64(15...0); mm(47 40) SaturateSignedWordToUnsignedByte mm/m64(31 16); mm(55 48) SaturateSignedWordToUnsignedByte mm/m64(47 32); mm(63 56) aturatesignedwordtounsignedbyte mm/m64(63 48); Pack with unsigned saturation --- PACKUSWB mm, mm/m64, Pack with signed saturation --- PACKSSWB mm, mm/m64, Pack and saturate signed words from MMX register and MMX register /memory into signed bytes in MMX register. mm(7 0) SaturateSignedWordToSigignedByte mm(15...0); mm(15 8) SaturateSignedWordToSignedByte mm(31 16); mm(23 16) SaturateSignedWordToSignedByte mm(47 32); mm(31 24) SaturateSignedWordToSignedByte mm(63 48); mm(39 32) SaturateSignedWordToSignedByte mm/m64(15...0); mm(47 40) SaturateSignedWordToSignedByte mm/m64(31 16); 9
mm(55 48) SaturateSignedWordToSignedByte mm/m64(47 32); mm(63 56) SaturateSignedWordToSignedByte mm/m64(63 48); Pack with signed saturation --- PACKSSDW mm, mm/m64, Pack and saturate signed dwords from MMX register and MMX register /memory into signed words in MMX register. mm(15 0) SaturateSignedDwordToSigignedWord mm(31...0); mm(31 16) SaturateSignedDwordToSignedWord mm(63 32); mm(47 32) SaturateSignedDwordToSignedWord mm/m64(31...0); mm(63 48) SaturateSignedDwordToSignedWord mm/m64(63 32); Unpack High Packed Data --- PUNPCKH[BW, WD, DQ]SSDW mm, mm/m64, Unpack and interleave the high-order data elements of the destination and source operands into the destination operand. The low order elements are ignored. E.g. PUNPCKHWD mm(63 48) mm/m64(63 48); mm(47 32) mm (63 48); mm(31 16) mm/m64(47 32); mm(15 0) mm (47 32); Unpack Low Packed Data --- PUNPCKL[BW, WD, DQ]SSDW mm, mm/m64, Unpack and interleave the low-order data elements of the destination and source operands into the destination operand. The high order elements are ignored. E.g. PUNPCKLWD mm(63 48) mm/m64(31 16); mm(47 32) mm (31 16); mm(31 16) mm/m64(15 0); mm(15 0) mm (15 0); 10
4.9 Data Transfer Move 32 bits --- MOVD mm, r/m32 move 32 bits from integer register/memory to MMX register mm(63 0) ZeroExtend(r/m32); Move 32 bits --- MOVD r/m32, mm move 32 bits from MMX register to integer register/memory r/m32 mm(31 0). Move 64 bits --- MOVQ mm, mm/m64 move 64 bits from MMX register/memory to MMX register mm mm/m64; --- MOVQ mm/64, mm move 64 bits from MMX register to MMX register/memory mm/m64 mm; 4.10 Instruction Samples e.g. MOVD MM0, EAX; PSLLQ MM0, 32; MOVD MM1, EBX; POR MM0, MM1; MOVQ MM2, MM3; PSLLQ MM3, 1; PXOR MM3, MM2; 11
Chapter 5. MMX Code Optimization 5.1 Code Optimization Guidelines use the current compiler do not intermix MMX instructions and FP instructions use the opcode reg, mem instruction format whenever possible put an EMMS instruction at the end of all MMX code sections that will transition to FP code Optimize data cache bandwidth to MMX register. 5.2 Accessing Memory Pentium II and III, -- opcode reg, mem (2 micro-ops) -- opcode reg, reg (1 micro-op) Recommend: merging loads whenever the same address is used more than once. (memory-bound) Recommend: merging loads whenever the same address is used more than twice. (not memory-bound) change MOVQ reg, reg and opcode reg, mem to 12
MOVQ reg, mem and opcode reg, reg to save one micro-op. Chapter 6 Programming Tools and Examples 6.1 Programming Tools MASM 6.11 or above. With 6.14 Patch ( install the ML614.exe) VC++ 6.0 can compile MMX instructions key functions written with assembly language including MMX instructions. Some C/C++ compilers also including the MASM tool 6.2 Programming Examples // Name: cpu_test.c // Purpose: to test some MMX instructions #include <stdio.h> // to test if the CPU is MMX compatible int cpu_test( ); // to left shift 16 bit for X and append the low 8 bit of y, return x; unsigned int MMX_test(unsigned int x, unsigned int y); //Main function for the program void main( void ) { int found_mmx=cpu_test(); if (found_mmx==1) printf("this CPU support MMX technology\n"); 13
else printf("this CPU doen NOT support MMX technology\n"); } // test the MMX instruction unsigned int x= 0x12345678; unsigned int y= 0x99999999; printf("the original value of x is 0x%x\n", x); printf("the original value of y is 0x%x\n", y); x=mmx_test(x, y); printf("after left shifting 16 bit of x and append the \n"); printf("low 8 bit of y, value of y is 0x%x\n", x); //Function Name: cpu_test //Return: If the CPU supports MMX, returns value 1, otherwise returns value 2 int cpu_test() { asm{ // test if the cpu support MMX mov eax,1; cpuid; test edx, 00800000h; jnz found; mov eax, 2; jmp end; found: mov eax, 1; end: EMMS; } /* Return with result in EAX */ } //Function Name: MMX_test //Parameters: Two unsigned integers x, and y //Purpose: to test some MMX instructions //Return: to left shift 16 bit for X and append it with the low 8 bit of Y, return X; unsigned int MMX_test(unsigned int x, unsigned int y) { _ asm{ mov eax, x; mov ebx, y; movd mm0, eax; 14
} } mov eax, 0xff; movd mm2, eax psllq mm0, 16; movd mm1, ebx; pand mm1, mm2; por mm0, mm1; movd eax, mm0; Results: This CPU support MMX technology The original value of x is 0x12345678 The original value of y is 0x99999999 After left shifting 16 bit of x and appending the low 8 bit of y, value of y is 0x56780099 Reference 1. MMX Technology Programmer s Reference Manual 2. MMX Technology Technical Overview 3. Intel Architecture Optimization Reference Manual 4. Intel Architecture Software Developer manual 1 5. Intel Architecture Software Developer manual 2 6. Intel Architecture Software Developer manual 3 7. Http://www.intel.com/ Appendix: MMX Instructions Sheet 15
MMX Instructions Sheet 16