Dan Stafford, Justine Bonnot

Similar documents
OpenCL Vectorising Features. Andreas Beckmann

ECE 571 Advanced Microprocessor-Based Design Lecture 4

Intel 64 and IA-32 Architectures Software Developer s Manual

( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture

Masterpraktikum Scientific Computing

SSE and SSE2. Timothy A. Chagnon 18 September All images from Intel 64 and IA 32 Architectures Software Developer's Manuals

Computer System Architecture

Fundamentals of Computer Design

SIMD Programming CS 240A, 2017

High Performance Computing and Programming 2015 Lab 6 SIMD and Vectorization

SIMD Exploitation in (JIT) Compilers

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Vector Processors. Kavitha Chandrasekar Sreesudhan Ramkumar

SWAR: MMX, SSE, SSE 2 Multiplatform Programming

Exercise Session 6. Data Processing on Modern Hardware L Fall Semester Cagri Balkesen

Kevin O Leary, Intel Technical Consulting Engineer

Intel 64 and IA-32 Architectures Software Developer s Manual

Beware Of Your Cacheline

High Performance Computing. Classes of computing SISD. Computation Consists of :

EJEMPLOS DE ARQUITECTURAS

Using Intel Streaming SIMD Extensions for 3D Geometry Processing

High Performance Computing: Tools and Applications

Improving Performance of Machine Learning Workloads

Guy Blank Intel Corporation, Israel March 27-28, 2017 European LLVM Developers Meeting Saarland Informatics Campus, Saarbrücken, Germany

Intel s MMX. Why MMX?

COSC 6385 Computer Architecture. Instruction Set Architectures

Review of Last Lecture. CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions. Great Idea #4: Parallelism.

Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2

CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions

SIMD: Data parallel execution

Case Study. Speeding MD5 Image Identification by 2x. Software. Intel Integrated Performance Primitives. High-Performance Computing

ME964 High Performance Computing for Engineering Applications

Targeting AVX-Enabled Processors Using PGI Compilers and Tools

History of the Intel 80x86

Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant

SIMD Instructions outside and inside Oracle 12c. Laurent Léturgez 2016

International Conference Russian Supercomputing Days. September 25-26, 2017, Moscow

Compiling for Scalable Computing Systems the Merit of SIMD. Ayal Zaks Intel Corporation Acknowledgements: too many to list

Vectorization on KNL

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Intel Knights Landing Hardware

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures

INF5063: Programming heterogeneous multi-core processors. Introduction. Håkon Kvale Stensland. August 25 th, 2015

Introduction. No Optimization. Basic Optimizations. Normal Optimizations. Advanced Optimizations. Inter-Procedural Optimizations

Double-precision General Matrix Multiply (DGEMM)

Online Course Evaluation. What we will do in the last week?

A Hybrid Implementation of Hamming Weight

Using SSE and IPP to Accelerate Algorithms

Most of the slides in this lecture are either from or adapted from slides provided by the authors of the textbook Computer Systems: A Programmer s

CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions

Overview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC.

CS:APP3e Web Aside OPT:SIMD: Achieving Greater Parallelism with SIMD Instructions

COE608: Computer Organization and Architecture

Intel released new technology call P6P

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Case Study. Optimizing an Illegal Image Filter System. Software. Intel Integrated Performance Primitives. High-Performance Computing

Conflict Detection-based Run-Length Encoding AVX-512 CD Instruction Set in Action

IN5050: Programming heterogeneous multi-core processors SIMD (and SIMT)

CS:APP3e Web Aside OPT:SIMD: Achieving Greater Parallelism with SIMD Instructions

Parallel Processing SIMD, Vector and GPU s

Intel Advisor XE. Vectorization Optimization. Optimization Notice

FAST FIR FILTERS FOR SIMD PROCESSORS WITH LIMITED MEMORY BANDWIDTH

Growth in Cores - A well rehearsed story

HPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber,

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

Implementation of DSP Algorithms

Computer System Architecture

Introduction to the x86 Architecture. Camiel Vanderhoeven

COMPUTER ORGANIZATION & ARCHITECTURE

Intel MPI Library Conditional Reproducibility

Data-Level Parallelism in SIMD and Vector Architectures. Advanced Computer Architectures, Laura Pozzi & Cristina Silvano

MAQAO hands-on exercises

The Era of Heterogeneous Computing

Extending C++ for Explicit Data-Parallel Programming via SIMD Vector Types

Advance CPU Design. MMX technology. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. ! Basic concepts

How to Write Fast Numerical Code Spring 2013 Lecture: Architecture/Microarchitecture and Intel Core

! An alternate classification. Introduction. ! Vector architectures (slides 5 to 18) ! SIMD & extensions (slides 19 to 23)

Machine-level Representation of Programs

Scientific Computing on GPUs: GPU Architecture Overview

CSCI 402: Computer Architectures

Compiling for Scalable Computing Systems the Merit of SIMD. Ayal Zaks Intel Corporation Acknowledgements: too many to list

Architectures of Flynn s taxonomy -- A Comparison of Methods

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

William Stallings Computer Organization and Architecture 8 th Edition. Micro-programmed Control

Scientific computing with non-standard floating point types

Programmazione Avanzata

MAQAO Hands-on exercises LRZ Cluster

Figure 1: 128-bit registers introduced by SSE. 128 bits. xmm0 xmm1 xmm2 xmm3 xmm4 xmm5 xmm6 xmm7

Kirill Rogozhin. Intel

VECTORISATION. Adrian

Parallel Processing SIMD, Vector and GPU s

The von Neumann Architecture. IT 3123 Hardware and Software Concepts. The Instruction Cycle. Registers. LMC Executes a Store.

MAQAO Hands-on exercises FROGGY Cluster

Introduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA

Advanced Computer Architecture Lab 4 SIMD

A study on SIMD architecture

Vectorized implementations of post-quantum crypto

CSCI-GA Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends

Transcription:

Dan Stafford, Justine Bonnot

Background Applications Timeline MMX 3DNow! Streaming SIMD Extension SSE SSE2 SSE3 and SSSE3 SSE4 Advanced Vector Extension AVX AVX2 AVX-512 Compiling with x86 Vector Processing Extensions Vector Processing Today

Exploits data level parallelism Reduces stalls from branches Equivalent to loop unrolling Scalar Processing Vector Processing

Instruction Data Scalar Processor SISD Scalar registers (Full) Vector Processor SIMD Vector registers Vector Processing Extension SIMD Scalar registers Vector inside of register Divided into separate components SISD: SIMD: (Full) Vector Processor Vector Processing Extension SIMD Results Instruction SIMD Results Data Single Instruction Single Data Single Instruction Multiple Data

Multimedia Processing Compression Graphics Image Processing Simulations Engineering Tools CAD Cryptography Etc

MMX 3DNow! SSE AVX Intel 1997 AMD 1998 Intel 1999 Intel and AMD 2008

Matrix Math Extensions Launched by Intel in 1997 Pentium II 8 64-bit integer registers Aliased with x87 floating point registers 0 64 byte byte byte byte byte byte byte byte word word word word double word double word

MMX Extension by AMD in 1998 K6-2 1998 Registers shared with MMX and x87 FPU 21 single precision floating point instructions Discontinued after 2010 0 64 byte byte byte byte byte byte byte byte word word word word double word double word single precision single precision

Introduced by Intel 1999 Pentium III Pentium III = Pentium II + SSE Intel s answer to AMD s 3DNow! Katamai New Instructions (KNI) 70 new instructions Single-precision floating point Few additional integer instructions 8 new 128-bit registers 0 128 single precision single precision single precision single precision

Wilamette New Instructions Intel Pentium 4 2001 144 new instructions Double precision (64-bit) support Extends MMX to use SSE registers Replaces MMX 0 128 word word word word word word word word double word double word double word double word single precision single precision single precision single precision double precision double precision

SSE3 Prescott New Instructions (PNI) 2004 13 new instructions DSP & 3D focused Iterate horizontally vs. vertically in an instruction SSSE3 Supplemental SSE3 Merom New Instructions (MNI) 2006 16 new instructions Byte permutations Fixed point multiplication with rounding Within-word accumulate 0 128 word word word word word word word word double word double word double word double word single precision single precision single precision single precision double precision double precision

SSE4.1 SSE4.2 Penryn New Instructions (PNI) 2007 Sum of absolute differences Dot products Floating point rounding Blending Packed operations Nehalem processors 2008 STTNI - String and Text New Instructions CRC32 0 128 word word word word word word word word double word double word double word double word single precision single precision single precision single precision double precision double precision

Proposed by Intel and AMD March 2008 Intel Sandy Bridge processor - 2011 AMD Bulldozer processor - 2011 VEX Coding Prefixes 3 Operand Instructions 16 256-bit registers Extension supported on legacy SSE instructions SSE instructions still only use 128 bit registers 0 256 1 2 3 4 5 6 7 8 double word or single precision 1 2 3 4 double precision

1 2 3 4 double precision Haswell New Instructions Intel Haswell processor 2013 Additions AVX and SSE integer instructions to 256 bits General-purpose bit manipulation and multiply Fused Multiply Add FMA3 d = round(a x b + c) Gather-Scatter Vector equivalent of register indirect addressing Permutations Vector Shifts 0 256 1 2 3 4 5 6 7 8 double word or single precision

Intel Knights Landing processor 2 nd gen Xeon Phi processors Scheduled 2016 Supports Enhanced Vector Extension (EVEX) 32 512-bit registers Up to 4 operand instructions 7 new opmask registers Explicit rounding control Compressed displacement addressing mode 0 512 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 double word or single precision 1 2 3 4 5 6 7 8 double precision

Cannot be used by all the applications Unroll loops and then save time Load a single array instead of executing several Loads

Most compilers do not support Vector processing Program has to be written by hand Problems can happen with memory alignment Data to process has to be known in advance

Memory has to be carefully aligned Newer compilers support compiling from high level languages Intel Compiler Suite 11.1 - AVX GCC 4.9 AVX-512 -m[sse, avx, avx512f, etc]

Where are vector processors today? Gone High bandwidth Custom designed and costly Super computers now use multiple CPU and GPU cores Cheaper Lower Bandwidth National Energy Research Scientific Computing Center Cori Will have Knights Landing Xeon Phis with AVX-512