Accelerating 3D Geometry Transformation with Intel MMX TM Technology

Similar documents
Using MMX Instructions to Perform 3D Geometry Transformations

An Efficient Vector/Matrix Multiply Routine using MMX Technology

Using Intel Streaming SIMD Extensions for 3D Geometry Processing

Using MMX Instructions to Perform 16-Bit x 31-Bit Multiplication

Using MMX Instructions to Perform Simple Vector Operations

Using MMX Instructions to Compute the L2 Norm Between Two 16-Bit Vectors

CS802 Parallel Processing Class Notes

Using MMX Instructions to Implement a Synthesis Sub-Band Filter for MPEG Audio Decoding

High Performance Computing. Classes of computing SISD. Computation Consists of :

Using MMX Instructions to Compute the L1 Norm Between Two 16-bit Vectors

Optimizing Memory Bandwidth

Using MMX Instructions to Implement a 1/3T Equalizer

AnySP: Anytime Anywhere Anyway Signal Processing

Take Home Final Examination (From noon, May 5, 2004 to noon, May 12, 2004)

FAST FIR FILTERS FOR SIMD PROCESSORS WITH LIMITED MEMORY BANDWIDTH

Using MMX Instructions to Implement 2D Sprite Overlay

Estimating Multimedia Instruction Performance Based on Workload Characterization and Measurement

Intel s MMX. Why MMX?

Using MMX Instructions to implement 2X 8-bit Image Scaling

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

The Skeleton Assembly Line

Using MMX Instructions to Implement Viterbi Decoding

Using MMX Instructions to Compute the AbsoluteDifference in Motion Estimation

CMSC 611: Advanced Computer Architecture

Hardware and Software Optimisation. Tom Spink

Mapping Vector Codes to a Stream Processor (Imagine)

Using MMX Instructions to Implement the G.728 Codebook Search

Using MMX Instructions to Implement a Modem Baseband Canceler

Assembly Language for Intel-Based Computers, 4 th Edition. Chapter 2: IA-32 Processor Architecture Included elements of the IA-64 bit

History of the Intel 80x86

Using Streaming SIMD Extensions in a Fast DCT Algorithm for MPEG Encoding

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures

EECE.3170: Microprocessor Systems Design I Summer 2017 Homework 4 Solution

CS220. April 25, 2007

Media Instructions, Coprocessors, and Hardware Accelerators. Overview

SYSTEM OPTIMIZATION OF SOLVING A SET OF LINEAR CONGRUENCIES

Lecture 4. Instruction Level Parallelism Vectorization, SSE Optimizing for the memory hierarchy

Baseline V IRAM Trimedia. Cycles ( x 1000 ) N

PERFORMANCE ANALYSIS OF ALTERNATIVE STRUCTURES FOR 16-BIT INTEGER FIR FILTER IMPLEMENTED ON ALTIVEC SIMD PROCESSING UNIT

x = 12 x = 12 1x = 16

printf("this program adds the value 10 to a given integer number.\n\n");

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5

Intel SIMD architecture. Computer Organization and Assembly Languages Yung-Yu Chuang

Using MMX Instructions for 3D Bilinear Texture Mapping

Microprocessor Extensions for Wireless Communications

Figure 1: 128-bit registers introduced by SSE. 128 bits. xmm0 xmm1 xmm2 xmm3 xmm4 xmm5 xmm6 xmm7

Using MMX Instructions to Implement a Row Filter Algorithm

Load Effective Address Part I Written By: Vandad Nahavandi Pour Web-site:

Vertex Shader Design I

Ramblr. Making Reassembly Great Again

Characterization of Native Signal Processing Extensions

Parallel FIR Filters. Chapter 5

Interfacing Compiler and Hardware. Computer Systems Architecture. Processor Types And Instruction Sets. What Instructions Should A Processor Offer?

ESTIMATING MULTIMEDIA INSTRUCTION PERFORMANCE BASED ON WORKLOAD CHARACTERIZATION AND MEASUREMENT

16.317: Microprocessor Systems Design I Fall 2014

Chapter 11. Addressing Modes

Power Estimation of UVA CS754 CMP Architecture

Pipelining and Vector Processing

Geometry. Prof. George Wolberg Dept. of Computer Science City College of New York

A New Approach to Determining the Time-Stamping Counter's Overhead on the Pentium Pro Processors *

Optimized ARM-Based Implementation of Low Density Parity Check Code (LDPC) Decoder in China Digital Radio (CDR)

CONCORDIA UNIVERSITY Department of Computer Science and Software Engineering COMP 228/4 Section PP Midterm Exam

SA31675 / CVE

Communicating with People (2.8)

GPU Acceleration of Matrix Algebra. Dr. Ronald C. Young Multipath Corporation. fmslib.com

CHAPTER 1 INTRODUCTION

A Single/Double Precision Floating-Point Reciprocal Unit Design for Multimedia Applications

EECS150 - Digital Design Lecture 09 - Parallelism

EECS 151/251A Fall 2017 Digital Design and Integrated Circuits. Instructor: John Wawrzynek and Nicholas Weaver. Lecture 14 EE141

AGDC Per-Pixel Shading. Sim Dietrich

CS 352H Computer Systems Architecture Exam #1 - Prof. Keckler October 11, 2007

Application Note. November 28, Software Solutions Group

High Speed Pipelined Architecture for Adaptive Median Filter

A study on SIMD architecture

Computer Processors. Part 2. Components of a Processor. Execution Unit The ALU. Execution Unit. The Brains of the Box. Processors. Execution Unit (EU)

S U N G - E U I YO O N, K A I S T R E N D E R I N G F R E E LY A VA I L A B L E O N T H E I N T E R N E T

12.1 Chapter Overview Tables Function Computation via Table Look-up. Calculation Via Table Lookups

Computer Graphics. Coordinate Systems and Change of Frames. Based on slides by Dianna Xu, Bryn Mawr College

16.317: Microprocessor Systems Design I Spring 2015

Datapoint 2200 IA-32. main memory. components. implemented by Intel in the Nicholas FitzRoy-Dale

[Sahu* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

Design and Optimization of Geometry Acceleration for Portable 3D Graphics

SOEN228, Winter Revision 1.2 Date: October 25,

Porting and Optimizing Multimedia Codecs for AMD64 architecture on Microsoft Windows

Machine and Assembly Language Principles

Evaluating MMX Technology Using DSP and Multimedia Applications

Registers. Ray Seyfarth. September 8, Bit Intel Assembly Language c 2011 Ray Seyfarth

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

Reverse Engineering II: The Basics

Intel Architecture MMX Technology

CSCE 212H, Spring 2008 Lab Assignment 3: Assembly Language Assigned: Feb. 7, Due: Feb. 14, 11:59PM

Implementation of a backprojection algorithm on CELL

Lecture 12. Motivation. Designing for Low Power: Approaches. Architectures for Low Power: Transmeta s Crusoe Processor

CS146 Computer Architecture. Fall Midterm Exam

Current Trends in Computer Graphics Hardware

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

Systems Architecture I

Low-Level Essentials for Understanding Security Problems Aurélien Francillon

RAID 0 (non-redundant) RAID Types 4/25/2011

Transcription:

Accelerating 3D Geometry Transformation with Intel MMX TM Technology ECE 734 Project Report by Pei Qi Yang Wang - 1 -

Content 1. Abstract 2. Introduction 2.1 3 -Dimensional Object Geometry Transformation 2.2 Intel MMX TM Technology 3. Implementation 3.1 Execution before optimization 3.2 Optimization by reversing registers 3.3 Optimization by re-ordering instructions 4. Simulation 4.1 Correctness 4.2 Performance 5. Conclusion 6. References 7. Percentage Contributions - 2 -

Abstract Three-dimensional (3D) graphics geometry transformation involves very heavy computation load running on micro computer system. The appearance of Intel MMX TM Technology increases the processing power dramatically and thereby speeds up the 3D geometry transformation as well. In this project, the implementation of 3D geometry transformation is implemented under MMX framework and the potential optimization at instruction level is examined to further speed up the execution of MMX instructions code. Reversing (renaming) registers and reordering instructions are two mainly typical methods used to pair instructions such that they could be executed within the same cycle in pipelined processor. Finally, the correctness and performance of our approach is tested through two simulations under Matlab and PLX framework respectively. At last, conclusion can be made that MMX TM Technology effectively accelerates 3D geometry transformation by providing multimedia extensions that are tailored to 2D or 3D graphics processing. In addition to that, through hand optimizing the assembly code, the execution time would be reduced by pairing more instructions and executing them in parallel. - 3 -

Introduction 2.1 Three-dimensional Geometry Transformation The three-dimensional objects are usually represented by thousands of polygons. Each vertex of polygons can be represented by a 4-element vector. X Y Z W Fig. 1 Representation of 3D Object X, Y, and Z are coordinates of each polygon in the 3D space. W contains perspective corrective information, which is essentially unused in our project for simplicity. When a 3D object performs a geometry transformation, such as shift, rotation, expansion and so on, we have to recalculate polygon shape which is actually matrix multiplication. V = M V T X a a a a X T 0 1 2 3 Y b T 0 b1 b2 b 3 Y = Z c c c c Z T 0 1 2 3 W d0 d1 d2 d3 1 T O X = a X + a Y + a Z + a 1 T 0 1 2 3 Y = b X + b Y + b Z + b 1 T 0 1 2 3 Z = c X + c Y + c Z + c 1 T 0 1 2 3 W = d X + d Y + d Z + d 1 T 0 1 2 3 Fig. 2 Matrix-based 3D Geometry Transformation Where, V o is the original vertex vector, M is the transformation matrix, and V T is the transformed vertex vector. Here we do not investigate the details about transformation itself, like how to construct transformation matrix M etc. The only concern here is the computation issue involved with transformation. From the above equations, obviously this operation will definitely incur a computation overhead since each vertex pixel requires 16 multiplies and 12 additions. However, there is also a lot of inherent parallelism which could be exploited to speedup the recalculation process. - 4 -

2.2 Intel MMX TM Technology The Intel Architecture MMX TM Technology provides a significant increase in processing power, through the use of its new single-instruction multiple-data (SIMD) extensions. MMX instruction set features 57 SIMD-type instructions and eight 64-bit wide MMX registers. In addition to that, 4 new data types are introduced into MMX instruction set to accelerate the processing by calculating in parallel. Thus, two 32-bit, four 16-bit, or eight 8-bit integer values can be operated on at once. In our project, we mainly investigate the PMADDWD instruction, which is new in MMX instruction and means Packed Multiply and Add, and testify how this instruction speeds up the computation. a3 a2 a1 a0 * * * * b3 b2 b1 b0 a3*b3+a2*b2 a1*b1+a0*b0 Fig. 3 PMADDWD Instruction PMADDWD instruction can perform both multiply and add on four 16-bit values simultaneously. Therefore, it is an excellent match for the four-element multiply/adds operation, such as 3D geometry transformation. Thus, this instruction reduces the previous workload to 4 multipliers and 2 adders for each of vertex pixel. - 5 -

Implementation 3.1 Execution before optimization In the routine executing 3D geometry transformation, the critical section is the multiplication of two matrices: M and V o. The code snippet of the below demonstrates the essence of the execution of transformation using MMX TM instruction set. mov eax,[esp]+ 4 mov ebx,[esp]+ 8 mov ecx,[esp]+12 mov edx,[esp]+16 16bits 16bits 16bits 16bits a0 a1 a2 a3 movq,mm3 movq mm0, 0[eax] NextVect: movq mm3,[ebx] movq mm4,mm3 pmaddwd mm4,mm0 movq mm5,mm4 psrlq mm5,32 paddd mm4,mm5 moved [edx],mm4 Fig. 4 Code Snippet X Y Z 1 movq,mm0 pmaddwd a0*x + a1*y a2*z + a3*1 mm4 a0*x + a1*y movq,mm5 movq,mm4 (a0*x+a1*y) + (a2*z+a3*1) movq,edx Fig. 5 Corresponding Register Status The above code is a demonstration of MMX TM instruction. mm0 ~ mm5 are temporary MMX registers for execution. First, the code pushes in pixel value of a single point x, y, z. and a correction vector 1 together into a 64 bit register mm0. It also takes in a single row of the transformation matrix M and stored it in a 64 bit register named mm3. A copy of mm3 is made and called mm4. mm4 was used to do PMADDWD instruction with mm0. After the instruction is done, the result is stored back into mm4. A copy of mm4 is made and called mm5. mm5 is shifted 32 bit to the right. mm4 and mm5 is add together and the information is stored in mm4. At the end, mm4, the result is stored in temporary pointer edx for further use. - 6 -

Optimization The speed of MMX routine, like other assembly code routines, could benefit greatly from hand optimization. Reversing or renaming registers and reordering instructions are two usually used methods to accelerate the execution of code. The basic idea is to make better use of the super-scalar pipelines, which enables two or more instructions to be executed simultaneously. In other words, we can execute two or more instruction within the same clock cycle and thereby reduce the total execution time eventually. 3.2 Reversing Registers The purpose of reversing registers is to eliminate the data dependency caused by unintended using temporary registers or using them in inappropriate sequence. A typical problem in this regard is shown as the following: 1 movq mm3, [ebx] 2 movq mm4, mm3 3 pmaddwd mm4 mm0 1 movq mm4, [ebx] 2 movq mm3, mm4 3 pmaddwd mm4 mm0 Fig. 6 Before Reversing Fig. 7 After Reversing Fig.6 and Fig.7 will yield the completely same result. In Fig.6, instruction 2 reads from mm3, and then writes to mm4, while instruction 3 reads from mm0 and writes to mm4. Obviously it is impossible to execute instruction 2 and 3 simultaneously since both of them write to mm4, which is a typical W-W data dependency. However in Fig.7, after reversing registers mm3 and mm4, instruction 2 read first from mm4 to mm3 and then instruction 3 writes to mm4. The previous data dependency no longer existed. We are able to pair the instruction 2 and 3 and execute them simultaneously. The cycles needed to execute these instructions are also reduced to 2 from early 3 and the execution time is also decreased consequently. This example clearly demonstrates that how to eliminate data dependency by simply reversing involved registers. 3.3 Re-ordering instructions Reversing registers can only eliminate data dependency locally in assembly code. However when facing the whole segment of code, we need to consider re-ordering the - 7 -

execution of instructions and thereby break the whole dependency chains affiliated to the present sequence in which instructions are executed. The ultimate goal is to get more instruction paired. NextVect: 1 movq mm4, [ebx] 2 movq mm3, mm4 3 pmaddwd mm4, mm0 4 movq mm5, mm4 5 psrlq mm4, 32 6 paddd mm5, mm4 7 movd [edx], mm5 8 movq mm4, mm3 9 pmaddwd mm4, mm1 10 movq mm5, mm4 11 psrlq mm4, 32 12 paddd mm5, mm4 13 movd [edx+2], mm5 14 movq mm4, mm3 15 pmaddwd mm4, mm2 16 movq mm5, mm4 17 psrlq mm4, 32 18 paddd mm5, mm4 19 movd [edx+4], mm5 20 add ebx, 8 21 add edx, 6 22 dec ecx 1 23 jnz NextVect NextVect: 1 movq mm3, [ebx] 2 movq mm4, mm3 3 movq mm5, mm4 4 pmaddwd mm3, mm0 5 pmaddwd mm4, mm1 6 pmaddwd mm5, mm2 7 movq mm6, mm3 8 psrlq mm3, 32 9 paddd mm3, mm6 10 movq mm6, mm4 11 psrlq mm4, 32 12 paddd mm4, mm6 13 movq mm6, mm5 14 psrlq mm5, 32 15 paddd mm5, mm6 16 movd [edx], mm3 17 movd [edx+2], mm4 18 movd [edx+4], mm5 19 add edx, 6 20 add ebx, 8 1-1 21 dec ecx 1 22 jnz NextVect 1-1 Fig. 8 Before Re-ordering Fig.9 After Re-ordering Fig.9 demonstrates that how re-order the sequence of execution shown in Fig.8 and thereby get more instructions paired. A very important issue must be presented at this point that all the optimizations cannot affect or change the correctness of original code. This is the prerequisite for any optimization and we develop a Matlab program to do the same operation and then compare the results from two executions. This concern will be more discussed in next section. - 8 -

Simulation 4.1 Correctness We just mentioned in last section, correctness is our first concern when doing optimization. In our project, we compare the two results. One was simulating the execution of optimized code in instruction form. The other was directly multiplying transformation matrix M and original vertex vector V o. Our optimized code yielded the same result with later and therefore testified the correctness of the optimized code. The only exception is when a very large value is used for coordinates X, Y or Z, the optimized code showed different result due to the truncation applied to keep the 16-bit integer representation. 4.2 Performance Improvement In our simulation, we testified the performance of optimized code measured with execution time and cycle consumption using Matlab and PLX framework respectively. Matlab Test Case Vertices number: 320 Execution Time (sec) Un-Optimized Optimized 0.644 0.6118 Vertices number: 3200 5.6943 5.601 Vertices number: 32000 78.0654 77.0027 Fig. 10 Execution Time Comparison Using Matlab From the above table, we found that the execution time was reduced after doing optimization. Especially this trend became more obvious when the vertices number was increased. - 9 -

PLX Test Case Single set instruction iteration Cycles Un-Optimized Optimized 60 52 Fig. 11 Cycle Count Using PLX Figure 11 shows cycle count comparison between un-optimized result and optimized one. The result shows a 8 cycle reduction in a single instruction iteration. This number will become more significant when multiple iterations executed. Since this number increase linearly, the result of multiple iterations is provided here. - 10 -

Conclusion The new instructions in MMX Technology can effectively accelerate 3D geometry transformation. In our project, we first illustrated how PMADDWD instruction can handle and save computation in hardware level. We then simulated our implementation with Matlab and PLX simulators. By rearranging instruction execution time, we can further reduce the total number of clock cycle, which makes our program more efficient. Both simulation results show clock cycle reduction. As a result, these results proves that we can improve the performance of execution through some optimizations at instruction level, such as reversing register, re-ordering instruction without sacrificing the correctness of code. - 11 -

References [1] W. Ma, C. Yang, Using Intel SIMD Extension for 3D Geometry Processing, 2000 L. Gwennap, Intel s MMX Speeds Multimedia, Vol 10(3), 1996 [2] C. Yang, B. Sano, A. R. Lebeck, Expliting Parallelism in Geometry Processing with General Purpose Processors and Floating-Point SIMD Instructions, Vol 49(9), September 2000 [3] MMX TM Technology Application Notes, Intel, Using MMX TM Instructions to Perform 3D Geometry Transformation [4] H. Nguyan, L. K. John, Exploiting SIMD parallelism in DSP and Multimedia Algorithm Using AltiVec Technology, In ICS99, 1999 [5] K. Akelay, T. Jermoluk, Hight-Performance Polygon Rendering, In Computer Graphics, pages 239-246, 1988-12 -

Percentage Contributions Tasks performed Pei Qi Yang Wang Literature search 50% 50% Method discussion 50% 50% Power point presentation 50% 50% Matlab programs 50% 50% Testing case preparation 50% 50% Report writing 50% 50% Overall 50% 50% Team Member#1: Pei Qi Signature Team Member#2: Yang Wang Signature - 13 -