Dan Stafford, Justine Bonnot

Background Applications Timeline MMX 3DNow! Streaming SIMD Extension SSE SSE2 SSE3 and SSSE3 SSE4 Advanced Vector Extension AVX AVX2 AVX-512 Compiling with x86 Vector Processing Extensions Vector Processing Today

Exploits data level parallelism Reduces stalls from branches Equivalent to loop unrolling Scalar Processing Vector Processing

Instruction Data Scalar Processor SISD Scalar registers (Full) Vector Processor SIMD Vector registers Vector Processing Extension SIMD Scalar registers Vector inside of register Divided into separate components SISD: SIMD: (Full) Vector Processor Vector Processing Extension SIMD Results Instruction SIMD Results Data Single Instruction Single Data Single Instruction Multiple Data

Multimedia Processing Compression Graphics Image Processing Simulations Engineering Tools CAD Cryptography Etc

MMX 3DNow! SSE AVX Intel 1997 AMD 1998 Intel 1999 Intel and AMD 2008

Matrix Math Extensions Launched by Intel in 1997 Pentium II 8 64-bit integer registers Aliased with x87 floating point registers 0 64 byte byte byte byte byte byte byte byte word word word word double word double word

MMX Extension by AMD in 1998 K6-2 1998 Registers shared with MMX and x87 FPU 21 single precision floating point instructions Discontinued after 2010 0 64 byte byte byte byte byte byte byte byte word word word word double word double word single precision single precision

Introduced by Intel 1999 Pentium III Pentium III = Pentium II + SSE Intel s answer to AMD s 3DNow! Katamai New Instructions (KNI) 70 new instructions Single-precision floating point Few additional integer instructions 8 new 128-bit registers 0 128 single precision single precision single precision single precision

Wilamette New Instructions Intel Pentium 4 2001 144 new instructions Double precision (64-bit) support Extends MMX to use SSE registers Replaces MMX 0 128 word word word word word word word word double word double word double word double word single precision single precision single precision single precision double precision double precision

SSE3 Prescott New Instructions (PNI) 2004 13 new instructions DSP & 3D focused Iterate horizontally vs. vertically in an instruction SSSE3 Supplemental SSE3 Merom New Instructions (MNI) 2006 16 new instructions Byte permutations Fixed point multiplication with rounding Within-word accumulate 0 128 word word word word word word word word double word double word double word double word single precision single precision single precision single precision double precision double precision

SSE4.1 SSE4.2 Penryn New Instructions (PNI) 2007 Sum of absolute differences Dot products Floating point rounding Blending Packed operations Nehalem processors 2008 STTNI - String and Text New Instructions CRC32 0 128 word word word word word word word word double word double word double word double word single precision single precision single precision single precision double precision double precision

Proposed by Intel and AMD March 2008 Intel Sandy Bridge processor - 2011 AMD Bulldozer processor - 2011 VEX Coding Prefixes 3 Operand Instructions 16 256-bit registers Extension supported on legacy SSE instructions SSE instructions still only use 128 bit registers 0 256 1 2 3 4 5 6 7 8 double word or single precision 1 2 3 4 double precision

1 2 3 4 double precision Haswell New Instructions Intel Haswell processor 2013 Additions AVX and SSE integer instructions to 256 bits General-purpose bit manipulation and multiply Fused Multiply Add FMA3 d = round(a x b + c) Gather-Scatter Vector equivalent of register indirect addressing Permutations Vector Shifts 0 256 1 2 3 4 5 6 7 8 double word or single precision

Intel Knights Landing processor 2 nd gen Xeon Phi processors Scheduled 2016 Supports Enhanced Vector Extension (EVEX) 32 512-bit registers Up to 4 operand instructions 7 new opmask registers Explicit rounding control Compressed displacement addressing mode 0 512 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 double word or single precision 1 2 3 4 5 6 7 8 double precision

Cannot be used by all the applications Unroll loops and then save time Load a single array instead of executing several Loads

Most compilers do not support Vector processing Program has to be written by hand Problems can happen with memory alignment Data to process has to be known in advance

Memory has to be carefully aligned Newer compilers support compiling from high level languages Intel Compiler Suite 11.1 - AVX GCC 4.9 AVX-512 -m[sse, avx, avx512f, etc]

Where are vector processors today? Gone High bandwidth Custom designed and costly Super computers now use multiple CPU and GPU cores Cheaper Lower Bandwidth National Energy Research Scientific Computing Center Cori Will have Knights Landing Xeon Phis with AVX-512