Dan Stafford, Justine Bonnot

Size: px

Start display at page:

Download "Dan Stafford, Justine Bonnot"

Steven Gregory
6 years ago
Views:

1 Dan Stafford, Justine Bonnot

2 Background Applications Timeline MMX 3DNow! Streaming SIMD Extension SSE SSE2 SSE3 and SSSE3 SSE4 Advanced Vector Extension AVX AVX2 AVX-512 Compiling with x86 Vector Processing Extensions Vector Processing Today

3 Exploits data level parallelism Reduces stalls from branches Equivalent to loop unrolling Scalar Processing Vector Processing

Instruction Data Scalar Processor SISD Scalar registers (Full) Vector Processor SIMD Vector registers Vector Processing Extension SIMD Scalar registers Vector inside of register Divided into

4 Instruction Data Scalar Processor SISD Scalar registers (Full) Vector Processor SIMD Vector registers Vector Processing Extension SIMD Scalar registers Vector inside of register Divided into separate components SISD: SIMD: (Full) Vector Processor Vector Processing Extension SIMD Results Instruction SIMD Results Data Single Instruction Single Data Single Instruction Multiple Data

5 Multimedia Processing Compression Graphics Image Processing Simulations Engineering Tools CAD Cryptography Etc

6 MMX 3DNow! SSE AVX Intel 1997 AMD 1998 Intel 1999 Intel and AMD 2008

7 Matrix Math Extensions Launched by Intel in 1997 Pentium II 8 64-bit integer registers Aliased with x87 floating point registers 0 64 byte byte byte byte byte byte byte byte word word word word double word double word

8 MMX Extension by AMD in 1998 K Registers shared with MMX and x87 FPU 21 single precision floating point instructions Discontinued after byte byte byte byte byte byte byte byte word word word word double word double word single precision single precision

9 Introduced by Intel 1999 Pentium III Pentium III = Pentium II + SSE Intel s answer to AMD s 3DNow! Katamai New Instructions (KNI) 70 new instructions Single-precision floating point Few additional integer instructions 8 new 128-bit registers single precision single precision single precision single precision

10 Wilamette New Instructions Intel Pentium new instructions Double precision (64-bit) support Extends MMX to use SSE registers Replaces MMX word word word word word word word word double word double word double word double word single precision single precision single precision single precision double precision double precision

11 SSE3 Prescott New Instructions (PNI) new instructions DSP & 3D focused Iterate horizontally vs. vertically in an instruction SSSE3 Supplemental SSE3 Merom New Instructions (MNI) new instructions Byte permutations Fixed point multiplication with rounding Within-word accumulate word word word word word word word word double word double word double word double word single precision single precision single precision single precision double precision double precision

12 SSE4.1 SSE4.2 Penryn New Instructions (PNI) 2007 Sum of absolute differences Dot products Floating point rounding Blending Packed operations Nehalem processors 2008 STTNI - String and Text New Instructions CRC word word word word word word word word double word double word double word double word single precision single precision single precision single precision double precision double precision

13 Proposed by Intel and AMD March 2008 Intel Sandy Bridge processor AMD Bulldozer processor VEX Coding Prefixes 3 Operand Instructions bit registers Extension supported on legacy SSE instructions SSE instructions still only use 128 bit registers double word or single precision double precision

14 double precision Haswell New Instructions Intel Haswell processor 2013 Additions AVX and SSE integer instructions to 256 bits General-purpose bit manipulation and multiply Fused Multiply Add FMA3 d = round(a x b + c) Gather-Scatter Vector equivalent of register indirect addressing Permutations Vector Shifts double word or single precision

15 Intel Knights Landing processor 2 nd gen Xeon Phi processors Scheduled 2016 Supports Enhanced Vector Extension (EVEX) bit registers Up to 4 operand instructions 7 new opmask registers Explicit rounding control Compressed displacement addressing mode double word or single precision double precision

16 Cannot be used by all the applications Unroll loops and then save time Load a single array instead of executing several Loads

17 Most compilers do not support Vector processing Program has to be written by hand Problems can happen with memory alignment Data to process has to be known in advance

18 Memory has to be carefully aligned Newer compilers support compiling from high level languages Intel Compiler Suite AVX GCC 4.9 AVX-512 -m[sse, avx, avx512f, etc]

19 Where are vector processors today? Gone High bandwidth Custom designed and costly Super computers now use multiple CPU and GPU cores Cheaper Lower Bandwidth National Energy Research Scientific Computing Center Cori Will have Knights Landing Xeon Phis with AVX-512

OpenCL Vectorising Features. Andreas Beckmann

OpenCL Vectorising Features. Andreas Beckmann Mitglied der Helmholtz-Gemeinschaft OpenCL Vectorising Features Andreas Beckmann Levels of Vectorisation vector units, SIMD devices width, instructions SMX, SP cores Cus, PEs vector operations within kernels