CS802 Parallel Processing Class Notes

Similar documents
CS220. April 25, 2007

Intel Architecture MMX Technology

Intel MMX Technology Overview

MMX TM Technology Technical Overview

Instruction Set Progression. from MMX Technology through Streaming SIMD Extensions 2

Intel SIMD architecture. Computer Organization and Assembly Languages Yung-Yu Chuang

Intel SIMD architecture. Computer Organization and Assembly Languages Yung-Yu Chuang 2006/12/25

Cannot increase performance by multiple issuing. -limitation of Instruction Fetch and decode rate (memory bottelneck) -Not enough ILP

Media Instructions, Coprocessors, and Hardware Accelerators. Overview

Intel s MMX. Why MMX?

Using MMX Instructions to Implement a 1/3T Equalizer

Using MMX Instructions to Perform Simple Vector Operations

Using MMX Instructions to Implement the G.728 Codebook Search

Using MMX Instructions to Compute the L1 Norm Between Two 16-bit Vectors

An Efficient Vector/Matrix Multiply Routine using MMX Technology

Using MMX Instructions to Perform 16-Bit x 31-Bit Multiplication

Using MMX Instructions for 3D Bilinear Texture Mapping

SEN361 Computer Organization. Prof. Dr. Hasan Hüseyin BALIK (8 th Week)

Using MMX Instructions to Compute the AbsoluteDifference in Motion Estimation

17. Instruction Sets: Characteristics and Functions

CS412/CS413. Introduction to Compilers Tim Teitelbaum. Lecture 21: Generating Pentium Code 10 March 08

History of the Intel 80x86

Component Operation 16

Using MMX Instructions to implement 2X 8-bit Image Scaling

High Performance Computing. Classes of computing SISD. Computation Consists of :

Using MMX Instructions to Implement a Row Filter Algorithm

Accelerating 3D Geometry Transformation with Intel MMX TM Technology

Instruction Set extensions to X86. Floating Point SIMD instructions

Using MMX Instructions to Implement a Synthesis Sub-Band Filter for MPEG Audio Decoding

Teaching the SIMD Execution Model: Assembling a Few Parallel Programming Skills

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures

Using MMX Technology in Digital Image Processing (Technical Report and Coding Examples) TR-98-13

SWAR: MMX, SSE, SSE 2 Multiplatform Programming

Intel Architecture Software Developer s Manual

Intel 64 and IA-32 Architectures Software Developer s Manual

Using MMX Instructions to Implement Viterbi Decoding

SOEN228, Winter Revision 1.2 Date: October 25,

AMD Extensions to the. Instruction Sets Manual

IA-32 Intel Architecture Software Developer s Manual

1 Overview of the AMD64 Architecture

Intel Xeon Scalable Processor

ESTIMATING MULTIMEDIA INSTRUCTION PERFORMANCE BASED ON WORKLOAD CHARACTERIZATION AND MEASUREMENT

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2018 Lecture 4

Winter Compiler Construction T11 Activation records + Introduction to x86 assembly. Today. Tips for PA4. Today:

Machine Code and Assemblers November 6

EEM336 Microprocessors I. Arithmetic and Logic Instructions

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 10 Instruction Sets: Characteristics and Functions

Defining and Using Simple Data Types

IA-32 Intel Architecture Software Developer s Manual

We will first study the basic instructions for doing multiplications and divisions

Topics Power tends to corrupt; absolute power corrupts absolutely. Computer Organization CS Data Representation

Chapter 3: Addressing Modes

Take Home Final Examination (From noon, May 5, 2004 to noon, May 12, 2004)

Optimizing Memory Bandwidth

Marking Scheme. Examination Paper Department of CE. Module: Microprocessors (630313)

CS Bootcamp x86-64 Autumn 2015

appendix b From LC-3 to x86

3DNow! Instruction Porting Guide. Application Note

Scott M. Lewandowski CS295-2: Advanced Topics in Debugging September 21, 1998

Islamic University Gaza Engineering Faculty Department of Computer Engineering ECOM 2125: Assembly Language LAB

Instructions moving data

What's New in Computers

MASM32 error A2070: invalid instruction operands It's unclear what the data size. Use cmp dword inc ecx Instruction operands must be the same size?

2010 Summer Answers [OS I]

CMSC 313 Lecture 07. Short vs Near Jumps Logical (bit manipulation) Instructions AND, OR, NOT, SHL, SHR, SAL, SAR, ROL, ROR, RCL, RCR

Using MMX Instructions to Implement a Modem Baseband Canceler

6/20/2011. Introduction. Chapter Objectives Upon completion of this chapter, you will be able to:

SSE and SSE2. Timothy A. Chagnon 18 September All images from Intel 64 and IA 32 Architectures Software Developer's Manuals

3DNow! Technology Manual

W4118: PC Hardware and x86. Junfeng Yang

COMPUTER ORGANIZATION & ARCHITECTURE

Using MMX Instructions to Implement 2D Sprite Overlay

The CPU and Memory. How does a computer work? How does a computer interact with data? How are instructions performed? Recall schematic diagram:

Reverse Engineering Low Level Software. CS5375 Software Reverse Engineering Dr. Jaime C. Acosta

3.1 DATA MOVEMENT INSTRUCTIONS 45

IN5050: Programming heterogeneous multi-core processors SIMD (and SIMT)

Instruction Set Architecture

Dr. Ramesh K. Karne Department of Computer and Information Sciences, Towson University, Towson, MD /12/2014 Slide 1

Chapter 2. lw $s1,100($s2) $s1 = Memory[$s2+100] sw $s1,100($s2) Memory[$s2+100] = $s1

Homework 2. Lecture 6: Machine Code. Instruction Formats for HW2. Two parts: How to do Homework 2!!!!

Computer Organization CS 206 T Lec# 2: Instruction Sets

When an instruction is initially read from memory it goes to the Instruction register.

CSE P 501 Compilers. x86 Lite for Compiler Writers Hal Perkins Autumn /25/ Hal Perkins & UW CSE J-1

Using MMX Instructions to Perform 3D Geometry Transformations

Lab 3. The Art of Assembly Language (II)

Registers. Ray Seyfarth. September 8, Bit Intel Assembly Language c 2011 Ray Seyfarth

ALT-Assembly Language Tutorial

Paul Cockshott and Kenneth Renfrew. SIMD Programming. Manual for Linux. and Windows. Springer

CO Computer Architecture and Programming Languages CAPL. Lecture 13 & 14

Millions of instructions per second [MIPS] executed by a single chip microprocessor

Preface. Intel Technology Journal Q3, Lin Chao Editor Intel Technology Journal

Figure 8-1. x87 FPU Execution Environment

ECOM Computer Organization and Assembly Language. Computer Engineering Department CHAPTER 7. Integer Arithmetic

CS241 Computer Organization Spring 2015 IA

Assembly Language for Intel-Based Computers, 4 th Edition. Chapter 2: IA-32 Processor Architecture Included elements of the IA-64 bit

ELEG3924 Microprocessor

IA-32 Architecture. Computer Organization and Assembly Languages Yung-Yu Chuang 2005/10/6. with slides by Kip Irvine and Keith Van Rhein

Assembly Language for Intel-Based Computers, 4 th Edition. Chapter 2: IA-32 Processor Architecture. Chapter Overview.

Memory Models. Registers

Computer Organization (II) IA-32 Processor Architecture. Pu-Jen Cheng

Transcription:

CS802 Parallel Processing Class Notes MMX Technology Instructor: Dr. Chang N. Zhang Winter Semester, 2006

Intel MMX TM Technology Chapter 1: Introduction to MMX technology 1.1 Features of the MMX Technology - MMX technology: to accelerate multimedia and communication by adding new instructions and defining new 64-bit data types. - MMX technology introduces new general-purpose instructions. These instructions operate in parallel on multiple data elements packed into 64-bit quantities. These instructions accelerate the performance of applications with compute-intensive algorithms that perform localized, recurring operations on small native data. These applications include motion video, combined graphics with video, image processing, audio, synthesis, speech synthesis and compression, telephony, video conferencing, 2D graphics, and 3d graphics. - Single Instruction, Multiple Data (SIMD) technique. The MMX technology uses SIMD technique to speed up software performance by processing multiple data elements in parallel, using a single instruction. The MMX technology supports parallel operations on byte, word, and doubleword data elements, and the new quadword (64-bit) integer data type. - 57 new instructions - Eight 64-bit wide MMX registers (MM0~MM7) - Four new data types 1.2 Advantages of the MMX Technology - SIMD provides parallelism, greatly increase performance on the PC platform - MMX technology is integrated into Intel Architecture (IA) processors, fully compatible with existing OS. - IA software will run on MMX technology-enabled system - MMX technology be used in applications, algorithm, and drivers 1

Chapter 2: MMX New Data Types & MMX Registers 2.1 MMX New Data Types The principal data type of the IA MMX technology is the packed fixed-point integer. The decimal point of the fixed-point values is implicit and is left for the user to control for maximum flexibility. The IA MMX technology defines the following four new 64-bit quantity: (1) Packed byte: Eight bytes packed into one 64-bit quantity (2) Packed word: Four words packed into one 64-bit quantity (3) Packed doubleword: Two words packed into one 64-bit quantity (4) Quadword : one 64-bit quantity 2.2 MMX Registers The IA MMX technology provides eight 64-bit, general-purpose registers. The registers are aliased on the floating-point registers. The operating system handles the MMX technology as it would handle floating-point. The MMX registers can hold packed 64-bit data types. The MMX instructions access the MMX registers directly using the register names MM0 to MM7. The MMX registers can be used to perform calculations on data. They cannot be used to address memory; addressing is accomplished by using the integer registers and standard IA addressing modes. 2

Chapter 3: MMX Instructions (Total 57) Overview 3.1 Types of Instructions arithmetic: add, subtract, multiply, arithmetic shift and multiply add. comparison: logic: AND, AND NOT, OR, and XOR shift: conversion: data transfer: EMMS: empty MMX state 3.2 MMX Instructions: Syntax Typical MMX instruction: -- Prefix: P for Packed -- Instruction operation: for example, ADD, CMP, XOR -- Suffix: US for Unsigned Saturation S for Signed saturation B,W, D, Q for the data type: Example: PADDUSW Packed Add Unsigned with Saturation for word 3.3 MMX Instructions: Format For data transfer instruction: -- destination and source operands can reside in memory, integer registers, or MMX registers For all other IA MMX instructions: -- destination operand: MMX register -- source operand: MMX register, memory, or immediate operands 3

3.4 MMX Instructions: Conventions source operand: at right place destination operand: at left place e.g. PSLLW mm, mm/m64 memory address: as the least significant byte of the data 3.5 MMX Instructions: Conventions Wrap Around: if overflow or underflow, a data is truncated, only the lower (least significant) bits are returned. Carry is ignored. Saturation: if overflow or underflow, a data is clipped (saturated) to a datarange limit for the data type. lower limit upper limit signed byte 80H 7FH signed word 8000H 7FFFH unsigned byte 00H FFH unsigned word 0000H FFFFH e.g for unsigned byte, e5h+62h= ffh (saturation) e5h+62h= 47H (wrap around) 4

Chapter 4: MMX Instructions 4.1 Arithmetic (PADD, Wrap around) PADDB mm, mm/m64, Operation as: mm(7 0) mm(7 0) + mm/m64(7...0) mm(15 8) mm(15 8) + mm/m64(15 8). mm(63 56) mm(63 56) +mm/m64(63 56) PADDW mm, mm/m64, Operation as: mm(15 0) mm(15 0) + mm/m64(15...0) mm(31 16) mm(31 16) + mm/m64(31 16). mm(63 48) mm(63 48) + mm/m64(63 48) 5

PADDD mm, mm/m64, Operation as: mm(31 0) mm(31 0) + mm/m64(31...0) mm(63 32) mm(63 32) + mm/m64(63 32) 4.2 Arithmetic (PADD, saturation) PADDSB mm, mm/m64, Operation as: mm(7 0) SaturateToSignedByte( mm(7 0) + mm/m64(7...0)) mm(15 8) SaturateToSignedByte( mm(15 8) + mm/m64(15 8)). mm(63 56) SaturateToSignedByte( mm(63 56) +mm/m64(63 56)) PADDSW mm, mm/m64, Operation as: mm(15 0) SaturateToSignedWord( mm(15 0) + mm/m64(15...0)) mm(31 16) SaturateToSignedWord( mm(31 16) + mm/m64(31 16)). mm(63 48) SaturateToSignedWord( mm(63 48) + mm/m64(63 48)) 6

4.3 Arithmetic Packed Add Unsigned with Saturation --- PADDUSB mm, mm/m64 --- PADDUSW mm, mm/m64 Subtraction: --- PSUB[B,W,D] mm, mm/m64 (Wrap Around) --- PSUBS[B,W] mm, mm/m64 (Saturation) --- PSUBUS[B,W] mm, mm/m64 (Saturation) 4.4 Arithmetic Packed Multiply and Add --- PMADDWD mm, mm/m64, Multiply the packed word by the packed word in MMX reg/memory. Add the 32-bit results pairwise and store in MMX register as dword. Packed Multiply High --- PMULHW mm, mm/m64, Multiply the signed packed word in MMX register with the signed packed word in MMX reg/memory, then store the high-order 16 bits of the result in MMX register. mm(15 0) (mm(15 0) * mm/m64(15...0)) (31 16); mm(31 16) (mm(31 16) * mm/m64(31 16)) (31 16); mm(47 32) (mm(47 32) * mm/m64(47 32)) (31 16); mm(63 48) (mm(63 48) * mm/m64(63 48)) (31 16); Packed Multiply Low --- PMULHL mm, mm/m64, Multiply the signed packed word in MMX register with the signed packed word in MMX reg/memory, then store the low-order 16 bits of the result in MMX register. mm(15 0) (mm(15 0) * mm/m64(15...0)) (15 0); 7

mm(31 16) (mm(31 16) * mm/m64(31 16)) (15 0); mm(47 32) (mm(47 32) * mm/m64(47 32)) (15 0); mm(63 48) (mm(63 48) * mm/m64(63 48)) (15 0); 4.5 Comparison Packed Compare for Equality [byte, word, doubleword] --- PCMPEQB mm, mm/m64, Return (0xff, or 0) --- PCMPEQW mm, mm/m64, Return (0xffff, or 0) --- PCMPEQD mm, mm/m64, Return (0xffffffff, or 0) Packed Compare for Greater than --- PCMPGT[B, W,Q]; 4.6 Logic Bit-wise Logical Exclusive OR --- PXOR mm, mm/m64, mm mm XOR mm/m64 Bit-wise Logical AND --- PAND mm, mm/m64, mm mm AND mm/m64 Bit-wise Logical AND NOT --- PANDN mm, mm/m64, mm (NOT mm) AND mm/m64 Bit-wise Logical OR --- POR mm, mm/m64, mm mm OR mm/m64 4.7 Shift Packed shift left logical (Shifting in zero) --- PSLL[W, D, Q] mm, mm/m64, Packed shift Right logical (Shifting in zero) --- PSRL[W, D,Q] mm, mm/m64, Packed shift right arithmetic (Shifting in sign bits) --- PSRA[W, D] mm, mm/m64, 8

4.8 Conversion Pack with unsigned saturation --- PACKUSWB mm, mm/m64, Pack and saturate signed words from MMX register and MMX register /memory into unsigned bytes in MMX register. mm(7 0) SaturateSignedWordToUnsignedByte mm(15...0); mm(15 8) SaturateSignedWordToUnsignedByte mm(31 16); mm(23 16) SaturateSignedWordToUnsignedByte mm(47 32); mm(31 24) SaturateSignedWordToUnsignedByte mm(63 48); mm(39 32) SaturateSignedWordToUnsignedByte m/m64(15...0); mm(47 40) SaturateSignedWordToUnsignedByte mm/m64(31 16); mm(55 48) SaturateSignedWordToUnsignedByte mm/m64(47 32); mm(63 56) aturatesignedwordtounsignedbyte mm/m64(63 48); Pack with unsigned saturation --- PACKUSWB mm, mm/m64, Pack with signed saturation --- PACKSSWB mm, mm/m64, Pack and saturate signed words from MMX register and MMX register /memory into signed bytes in MMX register. mm(7 0) SaturateSignedWordToSigignedByte mm(15...0); mm(15 8) SaturateSignedWordToSignedByte mm(31 16); mm(23 16) SaturateSignedWordToSignedByte mm(47 32); mm(31 24) SaturateSignedWordToSignedByte mm(63 48); mm(39 32) SaturateSignedWordToSignedByte mm/m64(15...0); mm(47 40) SaturateSignedWordToSignedByte mm/m64(31 16); 9

mm(55 48) SaturateSignedWordToSignedByte mm/m64(47 32); mm(63 56) SaturateSignedWordToSignedByte mm/m64(63 48); Pack with signed saturation --- PACKSSDW mm, mm/m64, Pack and saturate signed dwords from MMX register and MMX register /memory into signed words in MMX register. mm(15 0) SaturateSignedDwordToSigignedWord mm(31...0); mm(31 16) SaturateSignedDwordToSignedWord mm(63 32); mm(47 32) SaturateSignedDwordToSignedWord mm/m64(31...0); mm(63 48) SaturateSignedDwordToSignedWord mm/m64(63 32); Unpack High Packed Data --- PUNPCKH[BW, WD, DQ]SSDW mm, mm/m64, Unpack and interleave the high-order data elements of the destination and source operands into the destination operand. The low order elements are ignored. E.g. PUNPCKHWD mm(63 48) mm/m64(63 48); mm(47 32) mm (63 48); mm(31 16) mm/m64(47 32); mm(15 0) mm (47 32); Unpack Low Packed Data --- PUNPCKL[BW, WD, DQ]SSDW mm, mm/m64, Unpack and interleave the low-order data elements of the destination and source operands into the destination operand. The high order elements are ignored. E.g. PUNPCKLWD mm(63 48) mm/m64(31 16); mm(47 32) mm (31 16); mm(31 16) mm/m64(15 0); mm(15 0) mm (15 0); 10

4.9 Data Transfer Move 32 bits --- MOVD mm, r/m32 move 32 bits from integer register/memory to MMX register mm(63 0) ZeroExtend(r/m32); Move 32 bits --- MOVD r/m32, mm move 32 bits from MMX register to integer register/memory r/m32 mm(31 0). Move 64 bits --- MOVQ mm, mm/m64 move 64 bits from MMX register/memory to MMX register mm mm/m64; --- MOVQ mm/64, mm move 64 bits from MMX register to MMX register/memory mm/m64 mm; 4.10 Instruction Samples e.g. MOVD MM0, EAX; PSLLQ MM0, 32; MOVD MM1, EBX; POR MM0, MM1; MOVQ MM2, MM3; PSLLQ MM3, 1; PXOR MM3, MM2; 11

Chapter 5. MMX Code Optimization 5.1 Code Optimization Guidelines use the current compiler do not intermix MMX instructions and FP instructions use the opcode reg, mem instruction format whenever possible put an EMMS instruction at the end of all MMX code sections that will transition to FP code Optimize data cache bandwidth to MMX register. 5.2 Accessing Memory Pentium II and III, -- opcode reg, mem (2 micro-ops) -- opcode reg, reg (1 micro-op) Recommend: merging loads whenever the same address is used more than once. (memory-bound) Recommend: merging loads whenever the same address is used more than twice. (not memory-bound) change MOVQ reg, reg and opcode reg, mem to 12

MOVQ reg, mem and opcode reg, reg to save one micro-op. Chapter 6 Programming Tools and Examples 6.1 Programming Tools MASM 6.11 or above. With 6.14 Patch ( install the ML614.exe) VC++ 6.0 can compile MMX instructions key functions written with assembly language including MMX instructions. Some C/C++ compilers also including the MASM tool 6.2 Programming Examples // Name: cpu_test.c // Purpose: to test some MMX instructions #include <stdio.h> // to test if the CPU is MMX compatible int cpu_test( ); // to left shift 16 bit for X and append the low 8 bit of y, return x; unsigned int MMX_test(unsigned int x, unsigned int y); //Main function for the program void main( void ) { int found_mmx=cpu_test(); if (found_mmx==1) printf("this CPU support MMX technology\n"); 13

else printf("this CPU doen NOT support MMX technology\n"); } // test the MMX instruction unsigned int x= 0x12345678; unsigned int y= 0x99999999; printf("the original value of x is 0x%x\n", x); printf("the original value of y is 0x%x\n", y); x=mmx_test(x, y); printf("after left shifting 16 bit of x and append the \n"); printf("low 8 bit of y, value of y is 0x%x\n", x); //Function Name: cpu_test //Return: If the CPU supports MMX, returns value 1, otherwise returns value 2 int cpu_test() { asm{ // test if the cpu support MMX mov eax,1; cpuid; test edx, 00800000h; jnz found; mov eax, 2; jmp end; found: mov eax, 1; end: EMMS; } /* Return with result in EAX */ } //Function Name: MMX_test //Parameters: Two unsigned integers x, and y //Purpose: to test some MMX instructions //Return: to left shift 16 bit for X and append it with the low 8 bit of Y, return X; unsigned int MMX_test(unsigned int x, unsigned int y) { _ asm{ mov eax, x; mov ebx, y; movd mm0, eax; 14

} } mov eax, 0xff; movd mm2, eax psllq mm0, 16; movd mm1, ebx; pand mm1, mm2; por mm0, mm1; movd eax, mm0; Results: This CPU support MMX technology The original value of x is 0x12345678 The original value of y is 0x99999999 After left shifting 16 bit of x and appending the low 8 bit of y, value of y is 0x56780099 Reference 1. MMX Technology Programmer s Reference Manual 2. MMX Technology Technical Overview 3. Intel Architecture Optimization Reference Manual 4. Intel Architecture Software Developer manual 1 5. Intel Architecture Software Developer manual 2 6. Intel Architecture Software Developer manual 3 7. Http://www.intel.com/ Appendix: MMX Instructions Sheet 15

MMX Instructions Sheet 16