SWAR: MMX, SSE, SSE 2 Multiplatform Programming

Similar documents
History of the Intel 80x86

SSE and SSE2. Timothy A. Chagnon 18 September All images from Intel 64 and IA 32 Architectures Software Developer's Manuals

Instruction Set extensions to X86. Floating Point SIMD instructions

Intel 64 and IA-32 Architectures Software Developer s Manual

Intel s MMX. Why MMX?

Intel 64 and IA-32 Architectures Software Developer s Manual

Instruction Set Progression. from MMX Technology through Streaming SIMD Extensions 2

Using Intel Streaming SIMD Extensions for 3D Geometry Processing

IA-32 Intel Architecture Software Developer s Manual

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures

EJEMPLOS DE ARQUITECTURAS

Dan Stafford, Justine Bonnot

OpenCL Vectorising Features. Andreas Beckmann

Chapter 06: Instruction Pipelining and Parallel Processing. Lesson 14: Example of the Pipelined CISC and RISC Processors

IA-32 Intel Architecture Software Developer s Manual

CS Bootcamp x86-64 Autumn 2015

CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions

17. Instruction Sets: Characteristics and Functions

Review of Last Lecture. CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions. Great Idea #4: Parallelism.

Computer System Architecture

CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions

IA-32 Intel Architecture Software Developer s Manual

CS802 Parallel Processing Class Notes

( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture

COMPUTER ORGANIZATION & ARCHITECTURE

Optimization of Lattice QCD codes for the AMD Opteron processor

Computer Systems A Programmer s Perspective 1 (Beta Draft)

CS241 Computer Organization Spring Introduction to Assembly

A Multiprocessor system generally means that more than one instruction stream is being executed in parallel.

High Performance Computing and Programming 2015 Lab 6 SIMD and Vectorization

COE608: Computer Organization and Architecture

Martin Kruliš, v

UNIT 2 PROCESSORS ORGANIZATION CONT.

Overview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC.

Code optimization techniques

1 Overview of the AMD64 Architecture

Vector Processors. Kavitha Chandrasekar Sreesudhan Ramkumar

Application Note. November 28, Software Solutions Group

CSCI 402: Computer Architectures. Instructions: Language of the Computer (4) Fengguang Song Department of Computer & Information Science IUPUI

The CPU and Memory. How does a computer work? How does a computer interact with data? How are instructions performed? Recall schematic diagram:

Assembly Language - SSE and SSE2. Floating Point Registers and x86_64

Assembly Language - SSE and SSE2. Introduction to Scalar Floating- Point Operations via SSE

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Computer Labs: Profiling and Optimization

Credits and Disclaimers

Credits and Disclaimers

Assembly Language for Intel-Based Computers, 4 th Edition. Chapter 2: IA-32 Processor Architecture Included elements of the IA-64 bit

Intel Architecture Software Developer s Manual

CMSC Lecture 03. UMBC, CMSC313, Richard Chang

Advance CPU Design. MMX technology. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. ! Basic concepts

Computer Organization CS 206 T Lec# 2: Instruction Sets

COSC 6385 Computer Architecture. Instruction Set Architectures

Algorithms and Computation in Signal Processing

SIMD Programming CS 240A, 2017

IA-32 Architecture COE 205. Computer Organization and Assembly Language. Computer Engineering Department

Chapter 2. lw $s1,100($s2) $s1 = Memory[$s2+100] sw $s1,100($s2) Memory[$s2+100] = $s1

CS 265. Computer Architecture. Wei Lu, Ph.D., P.Eng.

2. Define Instruction Set Architecture. What are its two main characteristics? Be precise!

Computer Systems Laboratory Sungkyunkwan University

Figure 1: 128-bit registers introduced by SSE. 128 bits. xmm0 xmm1 xmm2 xmm3 xmm4 xmm5 xmm6 xmm7

SIMD Instructions outside and inside Oracle 12c. Laurent Léturgez 2016

Instruction Sets: Characteristics and Functions

CS 261 Fall Mike Lam, Professor. x86-64 Data Structures and Misc. Topics

Instruction Sets: Characteristics and Functions Addressing Modes

Typical Processor Execution Cycle

55:132/22C:160, HPCA Spring 2011

Instruction Set Architecture

Introduction to Machine/Assembler Language

Computer Labs: Profiling and Optimization

Fixed-Point Math and Other Optimizations

ECE232: Hardware Organization and Design

Machine-level Representation of Programs. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Review: Parallelism - The Challenge. Agenda 7/14/2011

Evaluating MMX Technology Using DSP and Multimedia Applications

Instruction Set Architecture (ISA)

Chapter 2. Instructions: Language of the Computer. HW#1: 1.3 all, 1.4 all, 1.6.1, , , , , and Due date: one week.

Systems Architecture I

What Transitioning from 32-bit to 64-bit x86 Computing Means Today

AMD Extensions to the. Instruction Sets Manual

Instruction Set Architectures

Using MMX Technology in Digital Image Processing (Technical Report and Coding Examples) TR-98-13

Topics Power tends to corrupt; absolute power corrupts absolutely. Computer Organization CS Data Representation

Intel SIMD architecture. Computer Organization and Assembly Languages Yung-Yu Chuang

Intel SIMD architecture. Computer Organization and Assembly Languages Yung-Yu Chuang 2006/12/25

Run time environment of a MIPS program

Parallelized Progressive Network Coding with Hardware Acceleration

Improving graphics processing performance using Intel Cilk Plus

Math 230 Assembly Programming (AKA Computer Organization) Spring 2008

Team 1. Common Questions to all Teams. Team 2. Team 3. CO200-Computer Organization and Architecture - Assignment One

CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Data Level Parallelism

Intel X86 Assembler Instruction Set Opcode Table

ECE 486/586. Computer Architecture. Lecture # 8

Moving from 32 to 64 bits while maintaining compatibility. Orlando Ricardo Nunes Rocha

Outline. What Makes a Good ISA? Programmability. Implementability. Programmability Easy to express programs efficiently?

8/16/12. Computer Organization. Architecture. Computer Organization. Computer Basics

Roadmap. Java: Assembly language: OS: Machine code: Computer system:

Assembly Language Programming 64-bit environments

Lecture Topics. Branch Condition Options. Branch Conditions ECE 486/586. Computer Architecture. Lecture # 8. Instruction Set Principles.

Computer System Architecture

Analyzing Cache Bandwidth on the Intel Core 2 Architecture

Transcription:

SWAR: MMX, SSE, SSE 2 Multiplatform Programming Relatore: dott. Matteo Roffilli roffilli@csr.unibo.it 1

What s SWAR? SWAR = SIMD Within A Register SIMD = Single Instruction Multiple Data MMX,SSE,SSE2,Power3DNow = Intel\AMD implementation of SWAR 2

Why SWAR? Enable parallel computing on commercial hardware Boost 2x your application at 0 cost Enjoy yourself learning assembler 3

Why not SWAR? Hard low-level programming Require know out of assembler Not portable code (yet!) 4

Pentium I X86 Instruction Set General Purpose Instruction X87 FPU Instruction System Instruction 5

General Purpose Instruction Data transfer Binary integer arithmetic Decimal arithmetic Logic operations Shift and rotate Bit and byte operations Program control String Flag control Segment register operations 6

X87 FPU Instruction Floating-point Integer Binary-coded decimal (BCD) operands 7

SIMD Instruction SIMD: Single Instruction Multiple Data MMX, SSE and SSE2 instruction provides a group of instructions that perform SIMD operations on packed integer and/or packed floatingpoint data elements contained in the 64-bit MMX, the 128-bit XMM or 128-bit MMX registers. enables increased performance on a wide variety of multimedia and communications applications. 8

What s new in Pentium II Pentium II=Pentium I + MMX MMX : MultiMedia Extensions 57 new instructions Eight 64-bit wide MMX registers Four new data types 9

What s new in Pentium III Pentium III=Pentium II + SSE SSE : Streaming SIMD Extensions 70 new instructions three categories: SIMD-Floating Point New Media Instruction Streaming Memory Instruction 10

The implementation of SSE SSE has 128-bit architectural width Double-cycling the existing 64-bit data paths. Deliver a realized 1.5 2x speedup Only have 10% die size overhead 11

SSE in details 1 Eight 128-bits registers, named XMM0 to XMM7. The principal data type used by SSE instructions is a packed single floating point operand; practically, four 32-bits encoded floats are packed in each of the eight XMM registers SSE instructions set includes some additional integer instructions that operate on classic 64- bits MMX registers (MM0 to MM7) 12

SSE in details 2 1. Arithmetic instructions 2. Logical instructions 3. Compare instructions 4. Shuffle instructions 5. Conversion instructions 6. Data movement instructions 7. State management instructions 8. Additional SIMD integer instructions 9. Cacheability control instructions 13

What s new in Pentium 4 Pentium 4=Pentium III + SSE2 SSE2-Streaming SIMD Extensions 2 144 new instructions for Double Precision FP. Cache and Memory Management. Enable 3-D Graphics. Video Decoding/Encoding/ Speech Recognition 128 bit MMX Registers. Separate Register for Data Movement 14

SIMD-FP Instruction SIMD feature introduce a new register file containing eight 128-bit registers Capable of holding a vector of four IEEE single precision FP data elements Allow four single precision FP operations to be carried out within a single instruction 15

Aligned or not aligned? Why use memory aligned? Fast memory-to-register load Require dedicated malloc function Why use memory not aligned? Slow memory-to-register load Fast integration into existing code Not require dedicated malloc function 16

MMX Repetition helps learning! 54 instructions for INTEGER SSE 70 instructions for FLOAT SSE2 144 instructions for DOUBLE 17

Portable Code Writing At&T vs Intel syntax Linux vs Windows syntax GNU Gcc 2.95 vs MS Visual Studio 6 18

How to write SWAR code? Intel compiler 6 Linux/Windows Direct Assembly Inline Assembly Link portable library (libsimd) 19

Real problem Intel Compiler: is not FREE! Direct Asm: too hard for newmbies! libsimb: under developing yet! Inline Asm: not portable but easy! 20

Linux Inline Assembly 1. asm volatile ( 2. /* load vector a[1-4] in 128bit reg */ 3. "movups %0, %%xmm0\n\t" 4. "movups %1, %%xmm1\n\t" 5. /* multiply */ 6. "mulps %%xmm1,%%xmm0 \n\t" 7. : 8. :"m"(a[0]),"m"(a[4]) 9. ); 21

Windows Inline Assembly 1. asm { 2. /* load vector a[1-4] in 128bit reg */ 3. mov edi, a 4. movups xmm0, [edi] 5. movups xmm1, [edi+16] 6. /* multiply */ 7. mulps xmm0, xmm1 8. } 22

Side by side 1. asm volatile ( 2. movups %0, %%xmm0\n\t 3. "mulps %%xmm1,%%xmm0 \n\t 1. asm { 2. mov edi, a 3. movups xmm0, [edi] 4. mulps xmm0, xmm1 4. : 5. :"m"(a[0]),"m"(a[4]) 6. ); 5. } 23

How to compile? Kernel 2.4.x MS Visual Studio 6 Gcc 2.95 / 3.x Binutils >2.13 Assembly Power Pack 6 24

How to debug? Directly in assembler! 25

Online Resurce Intel tutorial AMD tutorial and source code Sourceforge.net IBM whitepapers MS tutorial SPRITE Linux Day Proceding 26

THANKS FOR YOU ATTENTION! Some questions? Relatore: dott. Matteo Roffilli roffilli@csr.unibo.it 27