1 Intel s MMX Dr. Richard Enbody CSE 820 Why MMX? Make the Common Case Fast Multimedia and Communication consume significant computing resources. Providing specific hardware support makes sense. 1
2 Goals accelerate multimedia and communications applications. maintain full compatibility with existing operating systems and applications. exploit inherent parallelism in multimedia and communication algorithms includes new instructions and data types to improve performance. First Step: examine code Examined a wide range of applications: graphics, MPEG video, music synthesis, speech compression, speech recognition, image processing, games, video conferencing. Identified and analyzed the most computeintensive routines 2
3 Common Characteristics Small integer data types: e.g. 8bit pixels, 16bit audio samples Small, highly repetitive loops Frequent multiplyandaccumulate Computeintensive algorithms Highly parallel operations MMX Technology A set of basic, general purpose integer instructions: Single Instruction, Multiple Data (SIMD) 57 new instructions Eight 64bit wide MMX registers Four new data types 3
4 Data Types Data Types 4
5 Example Pixels are generally 8bit integers. Pack eight pixels into a 64bit MMX register. An MMX instruction takes all eight of the pixels at once from the MMX register, performs the arithmetic or logical operation on all eight elements in parallel, and writes the result into an MMX register. Compatibility No new exceptions or states are added. Aliases to existing FP registers: The exponent field of the corresponding floatingpoint register (bits 6478) and the sign bit (bit 79) are set to ones (1's), making the value in the register a NaN (Not a Number) or infinity when viewed as a floatingpoint value. 5
6 57 Instructions Basic arithmetic: add, subtract, multiply, arithmetic shift and multiplyadd Comparison Conversion: pack & unpack Logical Shift Move: registertoregister Load/Store: 64bit and 32bit 6
7 Packed Add Word with wrap around Each Addition is independent Rightmost overflows and wraps around Saturation Saturation: if addition results in overflow or underflow, the result is clamped to the largest or smallest value representable. This is important for pixel calculations where this would prevent a wraparound add from causing a black pixel to suddenly turn white 7
8 No Mode There is no "saturation mode bit : a new mode bit would require a change to the operating system. Separate instructions are used to generate wraparound and saturating results. Packed Add Word with unsigned saturation Each Addition is independent Rightmost saturates 8
9 MultiplyAccumulate multiplyaccumulate operations are fundamental to many signal processing algorithms like vectordotproducts, matrix multiplies, FIR and IIR Filters, FFTs, DCTs etc Packed MultiplyAdd Multiply bytes generating four 32bit results. Add the 2 products on the left for one result and the 2 products on the right for the other result. 9
10 Packed Parallel Compare No new condition code flags No existing IA condition code flags are affected by this instruction. Result can be used as a mask to select elements from different inputs using a logical operation, eliminating branchs. Packed Parallel Compare 10
11 Pack/Unpack Important when an algorithm needs higher precision in its intermediate calculations, as in image filtering. For example, image filtering involves a set of intermediate multiply operations between filter coefficients and a set of adjacent image pixels, accumulating all the values together. Pack 11
12 Conditional Select The Chroma Keying example demonstrates how conditional selection using the MMX instruction set removes branch mispredictions, in addition to performing multiple selection operations in parallel. Text overlay on a pix/video background, and sprite overlays in games are some of the other operations that would benefit from this technique. Chroma Keying 12
13 Chroma Keying (con t) Take pixels from the picture with the woman on a green background. A compare instruction builds a mask for that data. That mask is a sequence of bytes that are all ones or all zeros. We now know what is the unwanted background and what we want to keep. Create Mask Assume pixels alternate green/not_green 13
14 Combine:!AND, AND, OR Branch Removal Without MMX technology, each pixel is processed separately and requires a conditional branch. Using MMX instructions, eight 8bit pixels can be processed in parallel and no conditional branches are involved. 14
15 Vector Dot Product The vector dot product is one of the most basic algorithms used in signalprocessing of natural data such as images, audio, video and sound. PMADD does 4 multiplies and 2 adds at a time. Coupled with PADD, eight multiplyaccumulate operations can be performed: 2 PMADD and 2 PADD Vector Dot Product 15
16 Vector Dot Product Vector Dot Product Assuming precision is sufficient, a dotproduct on an 8element vector can be completed using 8 MMX instructions: 2 PMADDs, 2 PADDs, two shifts (if needed to fix the precision after the multiply), and 2 loads for one of the vectors (the other vector is loaded by the PMADD instruction which can have one of its operands come from memory). 16
17 Compare Compare With MMX technology, one third of the number of instructions is needed. Most MMX instructions can be executed in one clock cycle, so the performance improvement will be more dramatic than the simple ratio of instruction counts. 17
18 Matrix Multiply 3D games: computations that manipulate 3D objects use 4by4 matrices that are multiplied with 4element vectors many times. Each vector has the X,Y, Z and perspective corrective information for each pixel. The 4by4 matrix is used to rotate, scale, translate and update the perspective corrective information for each pixel. 18
19 Compare Matrix Multiply MMX required half the instructions. 19
20 Image Dissolve Using Alpha Blending Dissolve a Swan into a Flower Result_pixel = Flower_pixel * (alpha/255) + Swan_pixel * [1  (alpha/255)] Assume 640x480 resolution Dissolve: Millions of Inst. 20
21 Dissolve 1 billion fewer instructions for the 640x480 dissolve 21
22 Conclusion MMX appeared in 1997 in Pentium processors (with bigger cache). According to Intel, an MMX microprocessor runs a multimedia application up to 60% faster. In addition, it runs other applications about 10% faster 22
More information