Intel s MMX. Why MMX?

Similar documents
MMX TM Technology Technical Overview

Intel MMX Technology Overview

Cannot increase performance by multiple issuing. -limitation of Instruction Fetch and decode rate (memory bottelneck) -Not enough ILP

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures

Media Instructions, Coprocessors, and Hardware Accelerators. Overview

Evaluating MMX Technology Using DSP and Multimedia Applications

CS802 Parallel Processing Class Notes

Instruction Set extensions to X86. Floating Point SIMD instructions

Using Streaming SIMD Extensions in a Fast DCT Algorithm for MPEG Encoding

What's New in Computers

Intel 64 and IA-32 Architectures Software Developer s Manual

Media Signal Processing

Data Representation 1

Advance CPU Design. MMX technology. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. ! Basic concepts

COMP2611: Computer Organization. Data Representation

The CPU and Memory. How does a computer work? How does a computer interact with data? How are instructions performed? Recall schematic diagram:

SWAR: MMX, SSE, SSE 2 Multiplatform Programming

Using Intel Streaming SIMD Extensions for 3D Geometry Processing

C NUMERIC FORMATS. Overview. IEEE Single-Precision Floating-point Data Format. Figure C-0. Table C-0. Listing C-0.

VIII. DSP Processors. Digital Signal Processing 8 December 24, 2009

Objectives. Connecting with Computer Science 2

Microprocessor Extensions for Wireless Communications

Intel 64 and IA-32 Architectures Software Developer s Manual

Chapter 4. Operations on Data

History of the Intel 80x86

CO Computer Architecture and Programming Languages CAPL. Lecture 15

VICP Signal Processing Library. Further extending the performance and ease of use for VICP enabled devices

Floating Point Arithmetic

Intel Architecture MMX Technology

Implementation of DSP Algorithms

Module 2: Computer Arithmetic

Introduction to Computer Science (I1100) Data Storage

Representing and Manipulating Floating Points. Jo, Heeseung

Representing and Manipulating Floating Points

Chapter 3. Arithmetic Text: P&H rev

Number Systems and Computer Arithmetic

Using MMX Instructions to Compute the AbsoluteDifference in Motion Estimation

Dan Stafford, Justine Bonnot

DSP Platforms Lab (AD-SHARC) Session 05

FLOATING POINT NUMBERS

CS429: Computer Organization and Architecture

Enabling a Superior 3D Visual Computing Experience for Next-Generation x86 Computing Platforms One AMD Place Sunnyvale, CA 94088

Inf2C - Computer Systems Lecture 2 Data Representation

2. Define Instruction Set Architecture. What are its two main characteristics? Be precise!

IA-32 Intel Architecture Software Developer s Manual

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Using MMX Instructions to Implement a Synthesis Sub-Band Filter for MPEG Audio Decoding

The ALU consists of combinational logic. Processes all data in the CPU. ALL von Neuman machines have an ALU loop.

Computer Organisation CS303

CHAPTER 5: Representing Numerical Data

Intel Architecture Software Developer s Manual

Floating-point representations

Floating-point representations

CPE300: Digital System Architecture and Design

Edge Detection Using Streaming SIMD Extensions On Low Cost Robotic Platforms

Computer Organization: A Programmer's Perspective

Representing and Manipulating Floating Points. Computer Systems Laboratory Sungkyunkwan University

Floating Point Puzzles. Lecture 3B Floating Point. IEEE Floating Point. Fractional Binary Numbers. Topics. IEEE Standard 754

IA-32 Intel Architecture Software Developer s Manual

Processing Unit CS206T

CS 101, Mock Computer Architecture

Systems I. Floating Point. Topics IEEE Floating Point Standard Rounding Floating Point Operations Mathematical properties

AMD Extensions to the. Instruction Sets Manual

Computer Architecture and IC Design Lab. Chapter 3 Part 2 Arithmetic for Computers Floating Point

Divide: Paper & Pencil

Preface. Intel Technology Journal Q3, Lin Chao Editor Intel Technology Journal

Chapter 13 Reduced Instruction Set Computers

Instruction Set Architecture

Giving credit where credit is due

Representing and Manipulating Floating Points

Floating point. Today! IEEE Floating Point Standard! Rounding! Floating Point Operations! Mathematical properties. Next time. !

Floating Point. The World is Not Just Integers. Programming languages support numbers with fraction

Floating Point Arithmetic. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

Giving credit where credit is due

Number Systems. Both numbers are positive

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Digital Computer Arithmetic ECE 666

Floating Point Puzzles. Lecture 3B Floating Point. IEEE Floating Point. Fractional Binary Numbers. Topics. IEEE Standard 754

Vertex Shader Design I

ECE232: Hardware Organization and Design

IA-32 Intel Architecture Software Developer s Manual

Thomas Polzer Institut für Technische Informatik

Module 2. Embedded Processors and Memory. Version 2 EE IIT, Kharagpur 1

FAST FIR FILTERS FOR SIMD PROCESSORS WITH LIMITED MEMORY BANDWIDTH

EEM336 Microprocessors I. Arithmetic and Logic Instructions

Binary representation and data

Chapter 2 Float Point Arithmetic. Real Numbers in Decimal Notation. Real Numbers in Decimal Notation

System Programming CISC 360. Floating Point September 16, 2008

Chapter 12. CPU Structure and Function. Yonsei University

This Material Was All Drawn From Intel Documents

Topics Power tends to corrupt; absolute power corrupts absolutely. Computer Organization CS Data Representation

Representing and Manipulating Floating Points

Floating Point Numbers

Instruction Set Progression. from MMX Technology through Streaming SIMD Extensions 2

Representation of Numbers and Arithmetic in Signal Processors

UNIT 2 PROCESSORS ORGANIZATION CONT.

applications with SIMD and Hyper-Threading Technology by Chew Yean Yam Intel Corporation

Chapter 3: Arithmetic for Computers

Intel SIMD architecture. Computer Organization and Assembly Languages Yung-Yu Chuang

Floating Point Puzzles The course that gives CMU its Zip! Floating Point Jan 22, IEEE Floating Point. Fractional Binary Numbers.

Transcription:

Intel s MMX Dr. Richard Enbody CSE 820 Why MMX? Make the Common Case Fast Multimedia and Communication consume significant computing resources. Providing specific hardware support makes sense. 1

Goals accelerate multimedia and communications applications. maintain full compatibility with existing operating systems and applications. exploit inherent parallelism in multimedia and communication algorithms includes new instructions and data types to improve performance. First Step: examine code Examined a wide range of applications: graphics, MPEG video, music synthesis, speech compression, speech recognition, image processing, games, video conferencing. Identified and analyzed the most compute-intensive routines 2

Common Characteristics Small integer data types: e.g. 8-bit pixels, 16-bit audio samples Small, highly repetitive loops Frequent multiply-and-accumulate Compute-intensive algorithms Highly parallel operations MMX Technology A set of basic, general purpose integer instructions: Single Instruction, Multiple Data (SIMD) 57 new instructions Eight 64-bit wide MMX registers Four new data types 3

Data Types Data Types 4

Example Pixels are generally 8-bit integers. Pack eight pixels into a 64-bit MMX register. An MMX instruction takes all eight of the pixels at once from the MMX register, performs the arithmetic or logical operation on all eight elements in parallel, and writes the result into an MMX register. Compatibility No new exceptions or states are added. Aliases to existing FP registers: The exponent field of the corresponding floating-point register (bits 64-78) and the sign bit (bit 79) are set to ones (1's), making the value in the register a NaN (Not a Number) or infinity when viewed as a floating-point value. 5

57 Instructions Basic arithmetic: add, subtract, multiply, arithmetic shift and multiply-add Comparison Conversion: pack & unpack Logical Shift Move: register-to-register Load/Store: 64-bit and 32-bit 6

Packed Add Word with wrap around Each Addition is independent Rightmost overflows and wraps around Saturation Saturation: if addition results in overflow or underflow, the result is clamped to the largest or smallest value representable. This is important for pixel calculations where this would prevent a wrap-around add from causing a black pixel to suddenly turn white 7

No Mode There is no "saturation mode bit : a new mode bit would require a change to the operating system. Separate instructions are used to generate wrap-around and saturating results. Packed Add Word with unsigned saturation Each Addition is independent Rightmost saturates 8

Multiply-Accumulate multiply-accumulate operations are fundamental to many signal processing algorithms like vector-dot-products, matrix multiplies, FIR and IIR Filters, FFTs, DCTs etc Packed Multiply-Add Multiply bytes generating four 32-bit results. Add the 2 products on the left for one result and the 2 products on the right for the other result. 9

Packed Parallel Compare No new condition code flags No existing IA condition code flags are affected by this instruction. Result can be used as a mask to select elements from different inputs using a logical operation, eliminating branchs. Packed Parallel Compare 10

Pack/Unpack Important when an algorithm needs higher precision in its intermediate calculations, as in image filtering. For example, image filtering involves a set of intermediate multiply operations between filter coefficients and a set of adjacent image pixels, accumulating all the values together. Pack 11

Conditional Select The Chroma Keying example demonstrates how conditional selection using the MMX instruction set removes branch mis-predictions, in addition to performing multiple selection operations in parallel. Text overlay on a pix/video background, and sprite overlays in games are some of the other operations that would benefit from this technique. Chroma Keying 12

Chroma Keying (con t) Take pixels from the picture with the woman on a green background. A compare instruction builds a mask for that data. That mask is a sequence of bytes that are all ones or all zeros. We now know what is the unwanted background and what we want to keep. Create Mask Assume pixels alternate green/not_green 13

Combine:!AND, AND, OR Branch Removal Without MMX technology, each pixel is processed separately and requires a conditional branch. Using MMX instructions, eight 8-bit pixels can be processed in parallel and no conditional branches are involved. 14

Vector Dot Product The vector dot product is one of the most basic algorithms used in signalprocessing of natural data such as images, audio, video and sound. PMADD does 4 multiplies and 2 adds at a time. Coupled with PADD, eight multiply-accumulate operations can be performed: 2 PMADD and 2 PADD Vector Dot Product 15

Vector Dot Product Vector Dot Product Assuming precision is sufficient, a dotproduct on an 8-element vector can be completed using 8 MMX instructions: 2 PMADDs, 2 PADDs, two shifts (if needed to fix the precision after the multiply), and 2 loads for one of the vectors (the other vector is loaded by the PMADD instruction which can have one of its operands come from memory). 16

Compare Compare With MMX technology, one third of the number of instructions is needed. Most MMX instructions can be executed in one clock cycle, so the performance improvement will be more dramatic than the simple ratio of instruction counts. 17

Matrix Multiply 3D games: computations that manipulate 3D objects use 4-by-4 matrices that are multiplied with 4-element vectors many times. Each vector has the X,Y, Z and perspective corrective information for each pixel. The 4-by-4 matrix is used to rotate, scale, translate and update the perspective corrective information for each pixel. 18

Compare Matrix Multiply MMX required half the instructions. 19

Image Dissolve Using Alpha Blending Dissolve a Swan into a Flower Result_pixel = Flower_pixel * (alpha/255) + Swan_pixel * [1 - (alpha/255)] Assume 640x480 resolution Dissolve: Millions of Inst. 20

Dissolve 1 billion fewer instructions for the 640x480 dissolve 21

Conclusion MMX appeared in 1997 in Pentium processors (with bigger cache). According to Intel, an MMX microprocessor runs a multimedia application up to 60% faster. In addition, it runs other applications about 10% faster 22