SIMD Programming. Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna.

Similar documents
SIMD Programming. Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna.

Install Cisco ISE on a Linux KVM

Install Cisco ISE on a Linux KVM

Install Cisco ISE on a Linux KVM

Install Cisco ISE on a Linux KVM

Chapter 12. Virtualization

Cisco IOS XRv 9000 Appliance

Additional Installation Information

Intel QuickAssist Technology (Intel QAT) and OpenSSL-1.1.0: Performance

LINUX MEMORY EXPLAINED FRITS HOOGLAND

DPDK on POWER: A New Enhanced Packet Processing Architecture. Dr. Zhu Chao IBM Research China

Lecture 15 Floating point

John Wawrzynek & Nick Weaver

Additional Installation Information

Additional Installation Information

61C Survey. Agenda. Reference Problem. Application Example: Deep Learning 10/25/17. CS 61C: Great Ideas in Computer Architecture

High Performance Computing and Programming 2015 Lab 6 SIMD and Vectorization

High Performance Computing: Tools and Applications

Day 5: Introduction to Parallel Intel Architectures

OpenCL Vectorising Features. Andreas Beckmann

Computer System Architecture

Dan Stafford, Justine Bonnot

Red Hat Development Suite 1.2 Installation Guide

Red Hat Development Suite 1.3 Installation Guide

KVM Virtualization With Enomalism 2 On An Ubuntu 8.10 Server

Overview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC.

Technical Report. Research Lab: LERIA

Intel Processor Identification and the CPUID Instruction

Intel 64 and IA-32 Architectures Software Developer s Manual

SIMD Instructions outside and inside Oracle 12c. Laurent Léturgez 2016

SIMD: Data parallel execution

Intel 64 and IA-32 Architectures Software Developer s Manual

Red Hat Development Suite 2.0

Figure 1: 128-bit registers introduced by SSE. 128 bits. xmm0 xmm1 xmm2 xmm3 xmm4 xmm5 xmm6 xmm7

Programming the new KNL Cluster at LRZ KNL MCDRAM and KNC Offloading Dr. Volker Weinberg (LRZ)

( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture

last time out-of-order execution and instruction queues the data flow model idea

Hyper-Threading Performance with Intel CPUs for Linux SAP Deployment on ProLiant Servers. Session #3798. Hein van den Heuvel

Code Quality Analyzer (CQA)

SIMD Programming CS 240A, 2017

17. Instruction Sets: Characteristics and Functions

OpenMP: Vectorization and #pragma omp simd. Markus Höhnerbach

SIMD Exploitation in (JIT) Compilers

Parallel Computing Architectures

CPU models after Spectre & Meltdown. Paolo Bonzini Red Hat, Inc. KVM Forum 2018

CS 261 Fall Mike Lam, Professor. x86-64 Data Structures and Misc. Topics

MAQAO Hands-on exercises LRZ Cluster

Memory Models. Registers

Intel Knights Landing Hardware

SWAR: MMX, SSE, SSE 2 Multiplatform Programming

BOX-PC for BX-220 Series Ubuntu Desktop 64bit Operational Check Manual

Advanced Parallel Programming II

Kevin O Leary, Intel Technical Consulting Engineer

/INFOMOV/ Optimization & Vectorization. J. Bikker - Sep-Nov Lecture 5: SIMD (1) Welcome!

BOX-PC for BX-220 Series RedHat Enterprise Linux bit Operational Check Manual

MACHINE-LEVEL PROGRAMMING IV: Computer Organization and Architecture

Intel 64 and IA-32 Architectures Software Developer s Manual

Parallel Computing. Prof. Marco Bertini

How Software Executes

Multiple Architecture compiles. Timothy H. Kaiser, Ph.D

Manycore Processors. Manycore Chip: A chip having many small CPUs, typically statically scheduled and 2-way superscalar or scalar.

Guy Blank Intel Corporation, Israel March 27-28, 2017 European LLVM Developers Meeting Saarland Informatics Campus, Saarbrücken, Germany

Assembly I: Basic Operations. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2017 Lecture 7

Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant

MAQAO Hands-on exercises FROGGY Cluster

Important From Last Time

Paulo Henrique Schmidt Castellani. Understanding Linux's CPU Resource Management

Vectorization on KNL

SUSE Best Practices for SAP HANA on KVM

BOX-PC for BX-220 Series Ubuntu Desktop 32bit Operational Check Manual

Changes made in this version not seen in first lecture:

Parallel Computing Architectures

Machine/Assembler Language Putting It All Together

GTC 2014 Session 4155

Registers. Ray Seyfarth. September 8, Bit Intel Assembly Language c 2011 Ray Seyfarth

MAQAO hands-on exercises

SSE and SSE2. Timothy A. Chagnon 18 September All images from Intel 64 and IA 32 Architectures Software Developer's Manuals

Brian Leffler The Ajax Experience July 26 th, 2007

Important From Last Time

Page 1. Today. Important From Last Time. Is the assembly code right? Is the assembly code right? Which compiler is right?

Lecture 16 SSE vectorprocessing SIMD MultimediaExtensions

UNIT 2 PROCESSORS ORGANIZATION CONT.

Changelog. More Performance. loop unrolling (ASM) exam graded

CS:APP3e Web Aside OPT:SIMD: Achieving Greater Parallelism with SIMD Instructions

Cell Programming Tips & Techniques

Digital Forensics Lecture 3 - Reverse Engineering

Instruction Set extensions to X86. Floating Point SIMD instructions

Advanced Computer Architecture Lab 4 SIMD

Intel Advisor XE. Vectorization Optimization. Optimization Notice

CS 31: Intro to Systems Arrays, Structs, Strings, and Pointers. Kevin Webb Swarthmore College March 1, 2016

Exercise Session 6. Data Processing on Modern Hardware L Fall Semester Cagri Balkesen

Assembly Language for Intel-Based Computers, 4 th Edition. Chapter 2: IA-32 Processor Architecture Included elements of the IA-64 bit

מגישים: multiply.c (includes implementation of mult_block for 2.2\a): #include "multiply.h" /* Question 2.1, 1(a) */

NEON. Programmer s Guide. Version: 1.0. Copyright 2013 ARM. All rights reserved. ARM DEN0018A (ID071613)

IN5050: Programming heterogeneous multi-core processors SIMD (and SIMT)

Intel Virtualization Technology FlexMigration Application Note

BOX-PC for BX-320 Series Ubuntu Desktop 64bit Operational Check Manual

BOX-PC for BX-956S Series Ubuntu Desktop 64bit Operational Check Manual

Using Intel Streaming SIMD Extensions for 3D Geometry Processing

Transcription:

Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/

2

Credits Marat Dukhan (Georgia Tech) http://www.cc.gatech.edu/grads/m/mdukhan3/ Salvatore Orlando (Univ. Ca' Foscari di Venezia) 3

Single Instruction Multiple Data In the SIMD model, the same operation can be applied to multiple data items This is usually realized through special instructions that work with short, fixed-length arrays E.g., SSE and ARM NEON can work with 4-element arrays of 32-bit floats Scalar instruction a 13.0 b 12.3 SIMD instruction a 13.0 7.0-3.0 2.0 b 12.3 4.4 13.2-2.0 + + + + 25.3 11.4 10.2 0.0 + a + b 25.3 a + b 4

Programming for SIMD Compiler auto-vectorization Optimized SIMD libraries ATLAS, FFTW Domain Specific Languages (DSLs) for SIMD programming High level E.g., Intel SPMD program compiler Compiler-dependent vector data types Compiler SIMD intrinsics Assembly language Low level 5

A note of caution Most (all?) modern processors have SIMD instructions that can provide a considerable performance improvement Up to 4x / 8x theoretical speedup for 4/8 lanes SIMD Superlinear speedup can be observed in some cases However Compilers are not very smart at auto-vectorization Using SIMD instructions by hand is cumbersome, errorprone and non-portable Each processor has its own set of SIMD instructions Sometimes, different models from the same processor family have different SIMD instruction sets Yes, I'm looking at you Intel 6

Intel SIMD extensions timeline SSE2 [2000]: 8 or 16 128-bit vector registers, supports double-precision floating point operations later extensions: SSE3, SSSE3, SSE4.1, SSE4.2 MMX [1993]: 8 64bit vector registers, supports only integer operations 1993 1995 1997 1999 2001 2003 SSE [1999]: 8 or 16 128bit vector registers, supports single-precision floating point operations 2005 AVX2 [2013]: introduced in the Haswell microarchitecture extends the vector registers to 16 256-bit registers 2007 AVX [2008]: supported in the Intel Sandy Bridge processors, and later extends the vector registers to 16 registers of length 256 bits 2009 2011 2013 AVX-512 [2015] proposed in July 2013 and supported since 2015 with Intel's Knights Landing processor 7

Intel SSE/AVX SSE 16 128-bit SIMD registers XMM0 XMM15 (in x86-64 mode) 64 2x double 16x chars 16 256-bit SIMD registers YMM0 YMM15 8x 16-bit words AVX-512 32 4x floats AVX2 SSE/AVX types 4x 32-bit doublewords 32 512-bit SIMD registers ZMM0 ZMM31 8 16 32 64 2x 64-bit quadwords 128 1x 128-bit doublequadword 127 0 AVX2 types 8x floats 32 64 4x double 255 0 Source: https://software.intel.com/sites/default/files/m/d/4/1/d/8/intro_to_intel_avx.pdf 8

ARM NEON The NEON register bank consists of 32 64-bit registers The NEON unit can view the same register bank as: 16 128-bit quadword registers Q0 Q15 32 64-bit doubleword registers D0 D31 See http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dht0002a/index.html 16x 8-bit ints 8 8x 16-bit ints 16 32 4x 32-bit ints 64 2x 64-bit ints 32 4x floats 127 0 9

Is it worth the effort? In some cases, yes Example: BML cellular automaton (N = 1024, = 0.3, steps = 1024), CPU Xeon E3-1220 @ 3.10GHz 4 cores (no ht) + GCC 4.8.4, 16GB RAM; GPU nvidia Quadro K620 (384 CUDA cores) + nvcc V8.0.61 Serial OpenMP (4 cores) OpenMP + halo (4 cores) CUDA (no shmem) CUDA (shmem + optim) SIMD x16 (1 core) SIMD x16 (4 cores) 1 2.76 5.16 8.42 34.53 66.52 180.56 Speedup with respect to serial (higher is better) 10

Superlinear speedup? Init Test for (i=0; i<n; i++) (i) } 11

Superlinear speedup? Init Test Init Test for (i=0; i<n-3; i+=4) (i) (i+1) (i+2) (i+3) } 12 handle leftovers...

Superlinear speedup? Init Test Init Test Init Test SIMD_ SIMD_ for (i=0; i<n-3; i+=4) SIMD_(i,, i+3) } 13 handle leftovers...

Checking for SIMD support on Linux cat /proc/cpuinfo On Intel flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm arat pln pts On ARM Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32 14

Vectorization opportunities float vsum(float *v, int n) { float s = 0.0; int i; for (i=0; i<n; i++) s += v[i]; return s; } 3-1 2 0-4 7-5 11 2 8 21 3-3 7 11 12 15

Vectorization opportunities float vsum(float *v, int n) { float s = 0.0; int i; for (i=0; i<n; i++) s += v[i]; return s; } v 3-1 2 0 s 0 0 0 0 + + + + 3-1 2 0 s' -4 7-5 11 2 8 21 3-3 7 11 12 16

Vectorization opportunities float vsum(float *v, int n) { float s = 0.0; int i; for (i=0; i<n; i++) s += v[i]; return s; } v 3-1 2 0-4 7-5 11 s 3-1 2 0 + + + + s' -1 6-3 11 2 8 21 3-3 7 11 12 17

Vectorization opportunities float vsum(float *v, int n) { float s = 0.0; int i; for (i=0; i<n; i++) s += v[i]; return s; } v 3-1 2 0-4 7-5 11 2 8 21 3 s -1 6-3 11 + + + + 1 14 18 14 s' -3 7 11 12 18

Vectorization opportunities float vsum(float *v, int n) { float s = 0.0; int i; for (i=0; i<n; i++) s += v[i]; return s; } v 3-1 2 0-4 7-5 11 2 8 21 3-3 7 11 12 s 1 14 18 14 + + + + s' -2 21 29 26 19

Vectorization opportunities float vsum(float *v, int n) { float s = 0.0; int i; for (i=0; i<n; i++) s += v[i]; return s; } v 3-1 2 0-4 7-5 11 2 8 21 3-3 7 11 12 s -2 21 29 26 + 75 20

Vectorization opportunities In our example, care must be taken if the array length is not multiple of SIMD vector length Padding Add dummy elements at the beginning/end to make the array length multiple of SIMD vector length Not always feasible (requires to modify the input; choosing the values of the extra elements can be tricky and is problem-dependent) Handle the leftovers with scalar operations Possible (small?) performance hit Redundant code therefore error-prone if done by hand 21

float vsum(float *v, int n) { Not really SIMD; float s = 0.0; int i; assumes 4-lanes SIMD for (i=0; i<n; i++) s += v[i]; return s; } float vsum(float *v, int n) { float vs[4] = {0.0, 0.0, 0.0, 0.0}; float s = 0.0; int i; for (i=0; i<n-4; i += 4) { vs[0] += v[i ]; vs[1] += v[i+1]; vs[0:3] += v[i:i+3]; vs[2] += v[i+2]; vs[3] += v[i+3]; } s = vs[0] + vs[1] + vs[2] + vs[3]; /* Handle leftover */ for ( ; i<n; i++) { s += v[i]; } return s; } 22

Auto-vectorization 23

Compiler auto-vectorization To enable auto-vectorization with GCC You might need to turn on optimization to enable auto-vectorization Enable debugging output to see what gets vectorized and what does not -O2 -ftree-vectorize -fopt-info-vec-optimized -fopt-info-vec-missed -march=native Autodetect and use SIMD instructions available on your platform Warning: if you enable platform-specific optimizations, your binary will not run on less capable processors! To see which flags are enabled with -march=native gcc -march=native -Q --help=target 24

Selection of the target architecture the hard way On Intel: On ARM (RaspberryPI2/3) -msse emits SSE instructions -msse2 emits SIMD instructions up to SSE2... -msse4.2 emits SIMD instructions up to SSE4.2 -mavx emits SIMD instructions up to AVX (incl. SSEx) -mavx2 emits SIMD instructions up to AVX2 -march=armv7-a -mfpu=neon \ -mvectorize-with-neon-quad On ARM (NVIDIA Jetson TK1 dev. board) -march=armv7 -mfpu=neon \ -mvectorize-with-neon-quad 25

Auto-vectorization Let's try GCC auto-vectorization capabilities gcc -march=native -O2 -ftree-vectorize -fopt-info-vec-optimized -fopt-info-vec-missed simd-vsum-auto.c -o simd-vsum-auto simd-vsum-auto.c:45:5: simd-vsum-auto.c:45:5: s_10 = _9 + s_14; simd-vsum-auto.c:45:5: simd-vsum-auto.c:54:5: simd-vsum-auto.c:54:5: simd-vsum-auto.c:54:5: note: step unknown. note: reduction: unsafe fp math optimization: note: not vectorized: unsupported use in stmt. note: Unknown misalignment, is_packed = 0 note: virtual phi. skip. note: loop vectorized 26

Auto-vectorization GCC was unable to vectorize this loop: float vsum(float *v, int n) { float s = 0.0; int i; for (i=0; i<n; i++) s += v[i]; return s; } The output gives us a hint: simd-vsum-auto.c:45:5: note: reduction: unsafe fp math optimization: s_10 = _9 + s_14; 27

Auto-vectorization In this case, GCC is smarter than we are The familiar rules of infinite-precision math do not hold for finite-precision math FP addition is not associative, i.e., (a + b) + c could produce a different result than a + (b + c) Therefore, GCC by default does not apply optimizations that violate FP safety We can tell GCC to optimize anyway with -funsafe-math-optimizations 28

Bingo! gcc -funsafe-math-optimizations -march=native -O2 -ftree-vectorize -fopt-info-vec-optimized -fopt-infovec-missed simd-vsum-auto.c -o simd-vsum-auto simd-vsum-auto.c:45:5: note: loop vectorized simd-vsum-auto.c:54:5: note: loop vectorized 29

Auto-vectorization You can look at the generated assembly output with gcc -S -c -funsafe-math-optimizations -march=native -O2 -ftree-vectorize simd-vsum-auto.c -o simd-vsumauto.s.l4: xxxps instructions are those dealing with Packed Singleprecision SIMD registers movq %rcx, %r8 addq $1, %rcx salq $5, %r8 vaddps (%rdi,%r8), %ymm0, %ymm0 cmpl %ecx, %edx ja.l4 vhaddps %ymm0, %ymm0, %ymm0 vhaddps %ymm0, %ymm0, %ymm1 vperm2f128 $1, %ymm1, %ymm1, %ymm0 vaddps %ymm1, %ymm0, %ymm0 cmpl %esi, %eax je.l17 30

Auto-Vectorization the Good, the Bad, the Ugly

Vector data type 32

Vector data types Some compilers support vector data types Vector data types as (as the name suggests) small vectors of some numeric type Vector data types are non portable, compiler-specific extensions typically, char, int, float, double Ordinary arithmetic operations (sum, product...) can be applied to vector data types The compiler emits the appropriate SIMD instructions for the target architecture if available If no appropriate SIMD instruction is available, the compiler emits equivalent scalar code 33

Definition Defining vector data types /* v4i is a vector of elements of type int; variables of type v4i occupy 16 bytes of contiguous memory */ typedef int v4i attribute ((vector_size(16))); /* v4f is a vector of elements of type float; variables of type v4f occupy 16 bytes of contiguous memory */ typedef float v4f attribute ((vector_size(16))); The length (num. of elements) of v4f is #define VLEN (sizeof(v4f)/sizeof(float)) 34

Definition Is it possible to define vector types or arbitrary length /* v8f is a vector of elements of type float; variables of type v8f occupy 32 bytes of contiguous memory */ typedef float v8f attribute ((vector_size(32))); If the target architecture does not support SIMD registers of the specified length, the compiler takes care of that e.g., using multiple SIMD instructions on shorter vectors 35

Usage /* simd-vsum-vector.c */ typedef float v4f attribute ((vector_size(16))); #define VLEN (sizeof(v4f)/sizeof(float)) float vsum(float *v, int n) { v4f vs = {0.0f, 0.0f, 0.0f, 0.0f}; v4f *vv = (v4f*)v; int i; float s = 0.0f; for (i=0; i<n-vlen+1; i += VLEN) { vs += *vv; vv++; } s = vs[0] + vs[1] + vs[2] + vs[3]; for ( ; i<n; i++) { s += v[i]; } return s; } Variables of type v4f can be treated as standard arrays Variables of type v4f can be treated as standard arrays 36

Vector data types GCC allows the following operators on vector data types +, -, *, /, unary minus, ^,, &, ~, % It is also possible to use a binary vector operation where one operand is a scalar typedef int v4i attribute ((vector_size (16))); v4i a, b, c; long x; a a a a = = = = b + 1; 2 * b; 0; x + a; /* /* /* /* OK: a = b + {1,1,1,1}; OK: a = {2,2,2,2} * b; Error: conversion not allowed Error: cannot convert long to int https://gcc.gnu.org/onlinedocs/gcc/vector-extensions.html */ */ */ */ 37

Vector data types Vector comparison is supported with standard comparison operators: ==,!=, <, <=, >, >= Vectors are compared element-wise 0 when comparison is false -1 when comparison is true typedef int v4i attribute ((vector_size (16))); v4i a = {1, 2, 3, 4}; v4i b = {3, 2, 1, 4}; v4i c; c = (a > b); c = (a == b); /* Result {0, 0,-1, 0} /* Result {0,-1, 0,-1} */ */ https://gcc.gnu.org/onlinedocs/gcc/vector-extensions.html 38

Note on memory alignment Some versions of GCC emit assembly code for dereferencing a pointer to a vector datatype that only works if the memory address is 16B aligned More details later on malloc() may or may not return a pointer that is properly aligned. From the man page: The malloc() and calloc() functions return a pointer to the allocated memory, which is suitably aligned for any built-in type.????? 39

Ensuring proper alignment For data on the stack: /* BIGGEST_ALIGNMENT is 16 for SSE, 32 for AVX; it is therefore the preferred choice as it is automatically defined to suit the target */ float v[1024] attribute ((aligned( BIGGEST_ALIGNMENT ))); For data on the heap: #define _XOPEN_SOURCE 600 #include <stdlib.h> Do this at the very beginning (before including anything else); better yet, compile with the -D_XOPEN_SOURCE=600 flag to define the symbol compilation-wide float *v; posix_memalign(&v, BIGGEST_ALIGNMENT, 1024); Alignment Number of bytes to allocate 40

SIMDizing branches Branches (if-then-else) are difficult to SIMDize, since the SIMD programming model assumes that the same operation is applied to all elements of a SIMD register How can we SIMDize the following code fragment? int a[4] = { 12, -7, 2, int i; 3 }; for (i=0; i<4; i++) { if ( a[i] > 0 ) { a[i] = 2; } else { a[i] = 1; } } 41

v4i a = { 12, -7, 2, 3 }; v4i vtrue = {2, 2, 2, 2}; v4i vfalse = {1, 1, 1, 1}; v4i mask = (a > 0); /* mask = {-1, 0, -1, -1} */ a = (vtrue & mask) (vfalse & ~mask); vtrue 2 2 2 2 1 1 1 1 vfalse mask -1 0-1 -1 0-1 0 0 ~mask & & & & & & & & 2 0 2 2 0 1 0 0 vtrue & mask (vtrue & mask) (vfalse & ~mask) vfalse & ~mask 2 SIMD 1 Programming 2 2 42

Data layout: SoA vs AoS Arrays of Structures (AoS) particles[0] x y z particles[1] t x y z particles[2] t x y z t typedef point3d { float x, y, z, t; }; #define NP 1024 point3d particles[np]; Structures of Arrays (SoA) #define NP 1024 float particles_x[np]; float particles_y[np]; float particles_z[np]; float particles_t[np]; particles_x x[0] x[1] x[2] x[3] particles_y y[0] y[1] y[2] y[3] particles_z z[0] z[1] z[2] z[3] particles_t t[0] t[1] t[2] t[3] 43

Programming with SIMD intrinsics 44

SIMD intrinsics The compiler exposes all low-level SIMD operations through C functions SIMD intrinsics are platform-dependent Each function maps to the corresponding low-level SIMD instruction In other words, you are programming in assembly since different processors have different instruction sets However SIMD intrinsics are machine-dependent but compilerindependent Vector data types are machine-independent but compilerdependent 45

Vector sum with SSE intrinsics /* simd-vsum-intrinsics.c */ #include <x86intrin.h> float vsum(float *v, int n) { m128 vv, vs; float s = 0.0f; int i; vs = {0.0f, 0.0f, 0.0f, 0.0f} vs = _mm_setzero_ps(); for (i=0; i<n-4+1; i += 4) { vv = _mm_loadu_ps(&v[i]); vs = _mm_add_ps(vv, vs); } s = vs[0] + vs[1] + vs[2] + vs[3]; for ( ; i<n; i++) { s += v[i]; } return s; _mm_loadu_ps = Load Unaligned Packed Single-precision variables of type m128 can be used as vectors } 46

SSE intrinsics #include <x86intrin.h> Three new datatypes m128 Four single-precision floats m128d m128 m128d Two double-precision floats m128i 32 Two 64-bit integers Four 32-bit integers Eight 16-bit integers Sixteen 8-bit integers 64 m128i 64 32 16 8 47

SSE memory operations u = unaligned m128 _mm_loadu_ps(float *aptr) m128 _mm_load_ps(float *aptr) Store 4 floats from v to memory address aptr _mm_store_ps(float *aptr, m128 v) Load 4 floats starting from memory address aptr aptr must be a multiple of 16 _mm_storeu_ps(float *aptr, m128 v) Load 4 floats starting from memory address aptr Store 4 floats from v to memory address aptr aptr must be a multiple of 16 There are other intrinsics for load/store of doubles/ints 48

How values are stored in memory SSE SIMD registers are stored in memory as a0, a1, a2, a3 in increasing memory locations Memory address 127 a3 a0 x a1 x+4 a2 x+8 a3 x+12 0 a2 a1 a0 50

Some SSE arithmetic operations Do operation <op> on a and b, write result to c: m128 c = _mm_<op>_ps(a, b); Addition: c = a + b c = _mm_add_ps(a, b); Subtraction: c = a b c = _mm_sub_ps(a, b); Multiplication: c = a * b Division: c = a / b Minimum: c = fmin(a, b) c = _mm_mul_ps(a, b); c = _mm_min_ps(a, b); Maximum: c = fmax(a, b) c = _mm_div_ps(a, b); c = _mm_max_ps(a, b); Square root: c = sqrt(a) c = _mm_sqrt_ps(a); 51

Load constants v = _mm_set_ps(a, b, c, d) a b c d float d = _mm_cvtss_f32(v) a b c d v = _mm_set1_ps(a) a a a a v = _mm_setzero_ps(); 0.0 0.0 0.0 0.0 53

Resources Intel Intrinsics Guide https://software.intel.com/sites/landingpage/intrinsicsg uide/ 54