SSE and SSE2. Timothy A. Chagnon 18 September All images from Intel 64 and IA 32 Architectures Software Developer's Manuals

Similar documents
IA-32 Intel Architecture Software Developer s Manual

Intel 64 and IA-32 Architectures Software Developer s Manual

SIMD Programming CS 240A, 2017

IA-32 Intel Architecture Software Developer s Manual

Algorithms and Computation in Signal Processing

SWAR: MMX, SSE, SSE 2 Multiplatform Programming

Flynn Taxonomy Data-Level Parallelism

Datapoint 2200 IA-32. main memory. components. implemented by Intel in the Nicholas FitzRoy-Dale

Dan Stafford, Justine Bonnot

IA-32 Intel Architecture Software Developer s Manual

CS 261 Fall Mike Lam, Professor. x86-64 Data Structures and Misc. Topics

OpenCL Vectorising Features. Andreas Beckmann

Intel 64 and IA-32 Architectures Software Developer s Manual

CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Data Level Parallelism

( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture

CS 61C: Great Ideas in Computer Architecture

CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Data Level Parallelism

How to write powerful parallel Applications

MACHINE-LEVEL PROGRAMMING IV: Computer Organization and Architecture

The New C Standard (Excerpted material)

Pentium 4 Processor Block Diagram

Using Intel Streaming SIMD Extensions for 3D Geometry Processing

High Performance Computing and Programming 2015 Lab 6 SIMD and Vectorization

Computer Labs: Profiling and Optimization

Review: Parallelism - The Challenge. Agenda 7/14/2011

Multi-cycle Instructions in the Pipeline (Floating Point)

Next Generation Technology from Intel Intel Pentium 4 Processor

Optimizing Memory Bandwidth

Registers. Registers

Instruction Set Progression. from MMX Technology through Streaming SIMD Extensions 2

AMD Opteron TM & PGI: Enabling the Worlds Fastest LS-DYNA Performance

CS241 Computer Organization Spring Introduction to Assembly

Beware Of Your Cacheline

History of the Intel 80x86

CS4961 Parallel Programming. Lecture 7: Introduction to SIMD 09/14/2010. Homework 2, Due Friday, Sept. 10, 11:59 PM. Mary Hall September 14, 2010

ECE 571 Advanced Microprocessor-Based Design Lecture 4

Power Estimation of UVA CS754 CMP Architecture

Review of Last Lecture. CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions. Great Idea #4: Parallelism.

IA-32 Intel Architecture Optimization

CS 110 Computer Architecture

Intel released new technology call P6P

CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Manycore Processors. Manycore Chip: A chip having many small CPUs, typically statically scheduled and 2-way superscalar or scalar.

Software Optimization Guide for AMD Family 15h Processors

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

Agenda. Pentium III Processor New Features Pentium 4 Processor New Features. IA-32 Architecture. Sunil Saxena Principal Engineer Intel Corporation

CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions

Chapter 2. lw $s1,100($s2) $s1 = Memory[$s2+100] sw $s1,100($s2) Memory[$s2+100] = $s1

Computer System Architecture

Martin Kruliš, v

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5

Last Time: Floating Point. Intel x86 Processors. Lecture 4: Machine Basics Computer Architecture and Systems Programming ( )

HPC VT Machine-dependent Optimization

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures

Software Optimization Guide for AMD Family 10h Processors

DLP, Amdahl s Law, Intro to Multi-thread/processor systems Instructor: Nick Riasanovsky

Assembly Language Programming 64-bit environments

Most of the slides in this lecture are either from or adapted from slides provided by the authors of the textbook Computer Systems: A Programmer s

The Gap Between the Virtual Machine and the Real Machine. Charles Forgy Production Systems Tech

Memory Models. Registers

Intel SIMD architecture. Computer Organization and Assembly Languages Yung-Yu Chuang 2006/12/25

Advanced Computer Architecture Lab 4 SIMD

MAQAO Hands-on exercises LRZ Cluster

CS802 Parallel Processing Class Notes

Intel SIMD architecture. Computer Organization and Assembly Languages Yung-Yu Chuang

Compiling for Nehalem (the Intel Core Microarchitecture, Intel Xeon 5500 processor family and the Intel Core i7 processor)

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

Lecture 16 SSE vectorprocessing SIMD MultimediaExtensions

CISC, RISC and post-risc architectures

IA-32 Intel Architecture Optimization

Computer Labs: Profiling and Optimization

Assembly Language - SSE and SSE2. Floating Point Registers and x86_64

Carnegie Mellon. Bryant and O Hallaron, Computer Systems: A Programmer s Perspective, Third Edition

SIMD: Data parallel execution

Assembly Language - SSE and SSE2. Introduction to Scalar Floating- Point Operations via SSE

Instruction Set extensions to X86. Floating Point SIMD instructions

Optimizing Graphics Drivers with Streaming SIMD Extensions. Copyright 1999, Intel Corporation. All rights reserved.

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

CS 6354: Static Scheduling / Branch Prediction. 12 September 2016

A Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines

17. Instruction Sets: Characteristics and Functions

2.7 Supporting Procedures in hardware. Why procedures or functions? Procedure calls

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

How to Write Fast Numerical Code Spring 2013 Lecture: Architecture/Microarchitecture and Intel Core

Intel Architecture Software Developer s Manual

Real instruction set architectures. Part 2: a representative sample

Modern Processor Architectures. L25: Modern Compiler Design

This Material Was All Drawn From Intel Documents

Intel X86 Assembler Instruction Set Opcode Table

Parallel Computing. Prof. Marco Bertini

CMSC Lecture 03. UMBC, CMSC313, Richard Chang

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Compiler Options. Linux/x86 Performance Practical,

Computer Systems Laboratory Sungkyunkwan University

Chapter 06: Instruction Pipelining and Parallel Processing. Lesson 14: Example of the Pipelined CISC and RISC Processors

Computer System Architecture

CMSC 313 COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE PROGRAMMING LECTURE 03, SPRING 2013

These slides do not give detailed coverage of the material. See class notes and solved problems (last page) for more information.

Transcription:

SSE and SSE2 Timothy A. Chagnon 18 September 2007 All images from Intel 64 and IA 32 Architectures Software Developer's Manuals

Overview SSE: Streaming SIMD (Single Instruction Multiple Data) Extensions Successor to 64 bit MMX integer and AMD's 3DNow! extensions SSE Introduced in 1999 with the Pentium III 128 bit packed single precision floating point 8 128 bit registers xmm0 xmm7, hold 4 floats each Additional 8 on x86 64 (xmm8 xmm15) SSE2 Introduced with Pentium 4 Added double precision FP and several integer types SSE 3 & 4 Not yet widespread More ops: DSP, permutations, fixed point, dot product...

Data Types Instructions end in D, S, I, Q etc. to designate type Conversion instructions CVTxx2xx SSE SSE2

Instruction Basics Packed vs. Scalar Ops Most operate on 2 args OP xmm dest, xmm/m128 src In place operation on dest. Instruction Varieties MOV loads, stores Arithmetic & Logical Shuffle & Unpack CVT conversion Cache & Ordering Control

Arithmetic & Logical Instructions Can take a memory reference as 2 nd ADD SUB MUL DIV SQRT MAX MIN CMP argument All put results into 1 st argument CMP can be made to use EFLAGS Full instruction names are OP+TYPE... ADDPD add packed double ANDSS and scalar single etc. MOVAPD xmm0, [eax] MOVAPD xmm1, [eax+16] MOVAPD xmm2, xmm0 ADDPD xmm2, xmm1 SUBPD xmm0, xmm1 MOVAPD [eax], xmm2 MOVAPD [eax+16], xmm0

Moving Data to/from Memory MOVAPD move aligned MOVUPD move unaligned reg reg, reg mem, mem reg Memory should be 16 byte aligned if possible Starts at a 128 bit boundary in Virtual?/Physical? addressing Special types and attributes are used in C/C++ MOVUPD takes much longer than MOVAPD Intel's optimization manual says it's actually faster to do MOVSD xmm0, [mem] MOVSD xmm1, [mem+8] UNPCKLPD xmm0, xmm1

Shuffle & Unpack SHUFPD xmm0, xmm1, pattern UNPCKHPD xmm0, xmm1 UNPCKLPD xmm0, xmm1 Combine parts of 2 vectors In Place Value(s) from 1 st argument alway go into low ½ of dest. Note that UNPCKs are just a shortcut of SHUF But is it faster or slower?

C/C++ Intrinsics Both Intel's ICC and GCC have intrinsic functions to access the SSE instructions More or less a 1:1 mapping of instructions Instead of in place, they look like 2 in, 1 out functions Does this indicate that penalties are incurred for unnecessary register copies/loads/stores? Or are these optimized away?... It appears that way. MOV is split up into load, store, and set m128d _mm_add_pd( m128d a, m128d b) m128d _mm_load_pd(double const*dp) Adds the two DP FP values of a and b. r0 := a0 + b0 r1 := a1 + b1 (uses MOVAPD) Loads two DP FP values. The address p must be 16 byte aligned. r0 := p[0] r1 := p[1]

Example #include <xmmintrin.h> int main(void){ int i; m128d a, b, c; double x0[2] attribute ((aligned(16))) = {1.2, 3.5}; double x1[2] attribute ((aligned(16))) = { 0.7, 2.6}; a = _mm_load_pd(x0); b = _mm_load_pd(x1); c = _mm_add_pd(a, b); _mm_store_pd(x0, c); for(i = 0; i < 2; i++) printf("%.2f\n", x0[i]); } return 0;

Efficient Use Use 16 byte aligned memory Compiler driven automated vectorization is limited Use struct of arrays (SoA) instead of AoS (3d vectors) i.e. block across multiple entities and do normal operations

To Investigate Denormals and underflow can cause penalties (1500? cycles) Are intrinsics efficient? Do they use direct memory refs? SHUF vs UNPCK? Cache utilization, ordering instructions... x87 FPU and SSE are disjoint and can be mixed... MMX registers can be used for shuffle or copy Intel Core architecture has more efficient SIMD than NetBurst Direct memory references Trace cache and reorder buffer effects on loop unrolling Non temporal stores Prefetch

SHUF vs. UNPCK Intel Optimization Manual shows, NetBurst Arch. (Pentium 4, Xeon) sometimes UNPCK is faster Probably because there's no decoding of immediate field gcc 4.x will convert a call to the shuffle intrinsic to an unpck instruction if it can Core Arch. (Core Duo, Pentium M) Table data represents usage of register arguments; delays from direct memory references depend heavily on cache state. DisplayFamily_DisplayModel represent variation across different processors. Family 0F is NetBurst Architecture, Family 06 is Intel Core Architecture Model for NetBurst in range 00H to 06H, data for 03H applies to 04H and 06H Model for Pentium M are 09H and 0DH, Core Solo/Duo is 0EH, Core is 0FH

References Wikipedia: Streaming SIMD Extensions http://en.wikipedia.org/wiki/streaming_simd_extensions Intel 64 and IA 32 Architectures Software Developer's Manuals http://www.intel.com/products/processor/manuals/index.htm Intel C++ Compiler Documentation http://www.intel.com/software/products/compilers/docs/clin/main_cls/index.htm Apple Developer Connection SSE Performance Programming http://developer.apple.com/hardwaredrivers/ve/sse.html