Dan Stafford, Justine Bonnot

Size: px
Start display at page:

Download "Dan Stafford, Justine Bonnot"

Transcription

1 Dan Stafford, Justine Bonnot

2 Background Applications Timeline MMX 3DNow! Streaming SIMD Extension SSE SSE2 SSE3 and SSSE3 SSE4 Advanced Vector Extension AVX AVX2 AVX-512 Compiling with x86 Vector Processing Extensions Vector Processing Today

3 Exploits data level parallelism Reduces stalls from branches Equivalent to loop unrolling Scalar Processing Vector Processing

4 Instruction Data Scalar Processor SISD Scalar registers (Full) Vector Processor SIMD Vector registers Vector Processing Extension SIMD Scalar registers Vector inside of register Divided into separate components SISD: SIMD: (Full) Vector Processor Vector Processing Extension SIMD Results Instruction SIMD Results Data Single Instruction Single Data Single Instruction Multiple Data

5 Multimedia Processing Compression Graphics Image Processing Simulations Engineering Tools CAD Cryptography Etc

6 MMX 3DNow! SSE AVX Intel 1997 AMD 1998 Intel 1999 Intel and AMD 2008

7 Matrix Math Extensions Launched by Intel in 1997 Pentium II 8 64-bit integer registers Aliased with x87 floating point registers 0 64 byte byte byte byte byte byte byte byte word word word word double word double word

8 MMX Extension by AMD in 1998 K Registers shared with MMX and x87 FPU 21 single precision floating point instructions Discontinued after byte byte byte byte byte byte byte byte word word word word double word double word single precision single precision

9 Introduced by Intel 1999 Pentium III Pentium III = Pentium II + SSE Intel s answer to AMD s 3DNow! Katamai New Instructions (KNI) 70 new instructions Single-precision floating point Few additional integer instructions 8 new 128-bit registers single precision single precision single precision single precision

10 Wilamette New Instructions Intel Pentium new instructions Double precision (64-bit) support Extends MMX to use SSE registers Replaces MMX word word word word word word word word double word double word double word double word single precision single precision single precision single precision double precision double precision

11 SSE3 Prescott New Instructions (PNI) new instructions DSP & 3D focused Iterate horizontally vs. vertically in an instruction SSSE3 Supplemental SSE3 Merom New Instructions (MNI) new instructions Byte permutations Fixed point multiplication with rounding Within-word accumulate word word word word word word word word double word double word double word double word single precision single precision single precision single precision double precision double precision

12 SSE4.1 SSE4.2 Penryn New Instructions (PNI) 2007 Sum of absolute differences Dot products Floating point rounding Blending Packed operations Nehalem processors 2008 STTNI - String and Text New Instructions CRC word word word word word word word word double word double word double word double word single precision single precision single precision single precision double precision double precision

13 Proposed by Intel and AMD March 2008 Intel Sandy Bridge processor AMD Bulldozer processor VEX Coding Prefixes 3 Operand Instructions bit registers Extension supported on legacy SSE instructions SSE instructions still only use 128 bit registers double word or single precision double precision

14 double precision Haswell New Instructions Intel Haswell processor 2013 Additions AVX and SSE integer instructions to 256 bits General-purpose bit manipulation and multiply Fused Multiply Add FMA3 d = round(a x b + c) Gather-Scatter Vector equivalent of register indirect addressing Permutations Vector Shifts double word or single precision

15 Intel Knights Landing processor 2 nd gen Xeon Phi processors Scheduled 2016 Supports Enhanced Vector Extension (EVEX) bit registers Up to 4 operand instructions 7 new opmask registers Explicit rounding control Compressed displacement addressing mode double word or single precision double precision

16 Cannot be used by all the applications Unroll loops and then save time Load a single array instead of executing several Loads

17 Most compilers do not support Vector processing Program has to be written by hand Problems can happen with memory alignment Data to process has to be known in advance

18 Memory has to be carefully aligned Newer compilers support compiling from high level languages Intel Compiler Suite AVX GCC 4.9 AVX-512 -m[sse, avx, avx512f, etc]

19 Where are vector processors today? Gone High bandwidth Custom designed and costly Super computers now use multiple CPU and GPU cores Cheaper Lower Bandwidth National Energy Research Scientific Computing Center Cori Will have Knights Landing Xeon Phis with AVX-512

OpenCL Vectorising Features. Andreas Beckmann

OpenCL Vectorising Features. Andreas Beckmann Mitglied der Helmholtz-Gemeinschaft OpenCL Vectorising Features Andreas Beckmann Levels of Vectorisation vector units, SIMD devices width, instructions SMX, SP cores Cus, PEs vector operations within kernels

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 4

ECE 571 Advanced Microprocessor-Based Design Lecture 4 ECE 571 Advanced Microprocessor-Based Design Lecture 4 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 January 2016 Homework #1 was due Announcements Homework #2 will be posted

More information

Intel 64 and IA-32 Architectures Software Developer s Manual

Intel 64 and IA-32 Architectures Software Developer s Manual Intel 64 and IA-32 Architectures Software Developer s Manual Volume 1: Basic Architecture NOTE: The Intel 64 and IA-32 Architectures Software Developer's Manual consists of seven volumes: Basic Architecture,

More information

( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture

( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture ( ZIH ) Center for Information Services and High Performance Computing Overvi ew over the x86 Processor Architecture Daniel Molka Ulf Markwardt Daniel.Molka@tu-dresden.de ulf.markwardt@tu-dresden.de Outline

More information

Masterpraktikum Scientific Computing

Masterpraktikum Scientific Computing Masterpraktikum Scientific Computing High-Performance Computing Michael Bader Alexander Heinecke Technische Universität München, Germany Outline Logins Levels of Parallelism Single Processor Systems Von-Neumann-Principle

More information

SSE and SSE2. Timothy A. Chagnon 18 September All images from Intel 64 and IA 32 Architectures Software Developer's Manuals

SSE and SSE2. Timothy A. Chagnon 18 September All images from Intel 64 and IA 32 Architectures Software Developer's Manuals SSE and SSE2 Timothy A. Chagnon 18 September 2007 All images from Intel 64 and IA 32 Architectures Software Developer's Manuals Overview SSE: Streaming SIMD (Single Instruction Multiple Data) Extensions

More information

Computer System Architecture

Computer System Architecture CSC 203 1.5 Computer System Architecture Budditha Hettige Department of Statistics and Computer Science University of Sri Jayewardenepura Microprocessors 2011 Budditha Hettige 2 Processor Instructions

More information

Fundamentals of Computer Design

Fundamentals of Computer Design CS359: Computer Architecture Fundamentals of Computer Design Yanyan Shen Department of Computer Science and Engineering 1 Defining Computer Architecture Agenda Introduction Classes of Computers 1.3 Defining

More information

SIMD Programming CS 240A, 2017

SIMD Programming CS 240A, 2017 SIMD Programming CS 240A, 2017 1 Flynn* Taxonomy, 1966 In 2013, SIMD and MIMD most common parallelism in architectures usually both in same system! Most common parallel processing programming style: Single

More information

High Performance Computing and Programming 2015 Lab 6 SIMD and Vectorization

High Performance Computing and Programming 2015 Lab 6 SIMD and Vectorization High Performance Computing and Programming 2015 Lab 6 SIMD and Vectorization 1 Introduction The purpose of this lab assignment is to give some experience in using SIMD instructions on x86 and getting compiler

More information

SIMD Exploitation in (JIT) Compilers

SIMD Exploitation in (JIT) Compilers SIMD Exploitation in (JIT) Compilers Hiroshi Inoue, IBM Research - Tokyo 1 What s SIMD? Single Instruction Multiple Data Same operations applied for multiple elements in a vector register input 1 A0 input

More information

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 22 Title: and Extended

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

Vector Processors. Kavitha Chandrasekar Sreesudhan Ramkumar

Vector Processors. Kavitha Chandrasekar Sreesudhan Ramkumar Vector Processors Kavitha Chandrasekar Sreesudhan Ramkumar Agenda Why Vector processors Basic Vector Architecture Vector Execution time Vector load - store units and Vector memory systems Vector length

More information

SWAR: MMX, SSE, SSE 2 Multiplatform Programming

SWAR: MMX, SSE, SSE 2 Multiplatform Programming SWAR: MMX, SSE, SSE 2 Multiplatform Programming Relatore: dott. Matteo Roffilli roffilli@csr.unibo.it 1 What s SWAR? SWAR = SIMD Within A Register SIMD = Single Instruction Multiple Data MMX,SSE,SSE2,Power3DNow

More information

Exercise Session 6. Data Processing on Modern Hardware L Fall Semester Cagri Balkesen

Exercise Session 6. Data Processing on Modern Hardware L Fall Semester Cagri Balkesen Cagri Balkesen Data Processing on Modern Hardware Exercises Fall 2012 1 Exercise Session 6 Data Processing on Modern Hardware 263-3502-00L Fall Semester 2012 Cagri Balkesen cagri.balkesen@inf.ethz.ch Department

More information

Kevin O Leary, Intel Technical Consulting Engineer

Kevin O Leary, Intel Technical Consulting Engineer Kevin O Leary, Intel Technical Consulting Engineer Moore s Law Is Going Strong Hardware performance continues to grow exponentially We think we can continue Moore's Law for at least another 10 years."

More information

Intel 64 and IA-32 Architectures Software Developer s Manual

Intel 64 and IA-32 Architectures Software Developer s Manual Intel 64 and IA-32 Architectures Software Developer s Manual Volume 1: Basic Architecture NOTE: The Intel 64 and IA-32 Architectures Software Developer's Manual consists of five volumes: Basic Architecture,

More information

Beware Of Your Cacheline

Beware Of Your Cacheline Beware Of Your Cacheline Processor Specific Optimization Techniques Hagen Paul Pfeifer hagen@jauu.net http://protocol-laboratories.net Jan 18 2007 Introduction Why? Memory bandwidth is high (more or less)

More information

High Performance Computing. Classes of computing SISD. Computation Consists of :

High Performance Computing. Classes of computing SISD. Computation Consists of : High Performance Computing! Introduction to classes of computing! SISD! MISD! SIMD! Conclusion Classes of computing Computation Consists of :! Sequential Instructions (operation)! Sequential dataset We

More information

EJEMPLOS DE ARQUITECTURAS

EJEMPLOS DE ARQUITECTURAS Maestría en Electrónica Arquitectura de Computadoras Unidad 4 EJEMPLOS DE ARQUITECTURAS M. C. Felipe Santiago Espinosa Marzo/2017 ARM & MIPS Similarities ARM: the most popular embedded core Similar basic

More information

Using Intel Streaming SIMD Extensions for 3D Geometry Processing

Using Intel Streaming SIMD Extensions for 3D Geometry Processing Using Intel Streaming SIMD Extensions for 3D Geometry Processing Wan-Chun Ma, Chia-Lin Yang Dept. of Computer Science and Information Engineering National Taiwan University firebird@cmlab.csie.ntu.edu.tw,

More information

High Performance Computing: Tools and Applications

High Performance Computing: Tools and Applications High Performance Computing: Tools and Applications Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology Lecture 8 Processor-level SIMD SIMD instructions can perform

More information

Improving Performance of Machine Learning Workloads

Improving Performance of Machine Learning Workloads Improving Performance of Machine Learning Workloads Dong Li Parallel Architecture, System, and Algorithm Lab Electrical Engineering and Computer Science School of Engineering University of California,

More information

Guy Blank Intel Corporation, Israel March 27-28, 2017 European LLVM Developers Meeting Saarland Informatics Campus, Saarbrücken, Germany

Guy Blank Intel Corporation, Israel March 27-28, 2017 European LLVM Developers Meeting Saarland Informatics Campus, Saarbrücken, Germany Guy Blank Intel Corporation, Israel March 27-28, 2017 European LLVM Developers Meeting Saarland Informatics Campus, Saarbrücken, Germany Motivation C AVX2 AVX512 New instructions utilized! Scalar performance

More information

Intel s MMX. Why MMX?

Intel s MMX. Why MMX? Intel s MMX Dr. Richard Enbody CSE 820 Why MMX? Make the Common Case Fast Multimedia and Communication consume significant computing resources. Providing specific hardware support makes sense. 1 Goals

More information

COSC 6385 Computer Architecture. Instruction Set Architectures

COSC 6385 Computer Architecture. Instruction Set Architectures COSC 6385 Computer Architecture Instruction Set Architectures Spring 2012 Instruction Set Architecture (ISA) Definition on Wikipedia: Part of the Computer Architecture related to programming Defines set

More information

Review of Last Lecture. CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions. Great Idea #4: Parallelism.

Review of Last Lecture. CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions. Great Idea #4: Parallelism. CS 61C: Great Ideas in Computer Architecture The Flynn Taxonomy, Intel SIMD Instructions Instructor: Justin Hsia 1 Review of Last Lecture Amdahl s Law limits benefits of parallelization Request Level Parallelism

More information

Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2

Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2 Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2 This release of the Intel C++ Compiler 16.0 product is a Pre-Release, and as such is 64 architecture processor supporting

More information

CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions

CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions CS 61C: Great Ideas in Computer Architecture The Flynn Taxonomy, Intel SIMD Instructions Instructor: Justin Hsia 3/08/2013 Spring 2013 Lecture #19 1 Review of Last Lecture Amdahl s Law limits benefits

More information

SIMD: Data parallel execution

SIMD: Data parallel execution ERLANGEN REGIONAL COMPUTING CENTER SIMD: Data parallel execution J. Eitzinger HLRS, 15.6.2018 CPU Stored Program Computer: Base setting Memory for (int j=0; j

More information

Case Study. Speeding MD5 Image Identification by 2x. Software. Intel Integrated Performance Primitives. High-Performance Computing

Case Study. Speeding MD5 Image Identification by 2x. Software. Intel Integrated Performance Primitives. High-Performance Computing Case Study Software Speeding MD5 Image Identification by 2x Intel Integrated Performance Primitives High-Performance Computing TenCent Optimizes Image Identification Tencent, Inc., is China s largest and

More information

ME964 High Performance Computing for Engineering Applications

ME964 High Performance Computing for Engineering Applications ME964 High Performance Computing for Engineering Applications Execution Scheduling in CUDA Revisiting Memory Issues in CUDA February 17, 2011 Dan Negrut, 2011 ME964 UW-Madison Computers are useless. They

More information

Targeting AVX-Enabled Processors Using PGI Compilers and Tools

Targeting AVX-Enabled Processors Using PGI Compilers and Tools Targeting AVX-Enabled Processors Using PGI Compilers and Tools Brent Leback, John Merlin, and Steven Nakamoto, The Portland Group (PGI) ABSTRACT: AMD and Intel are releasing new microprocessors in 2011

More information

History of the Intel 80x86

History of the Intel 80x86 Intel s IA-32 Architecture Cptr280 Dr Curtis Nelson History of the Intel 80x86 1971 - Intel invents the microprocessor, the 4004 1975-8080 introduced 8-bit microprocessor 1978-8086 introduced 16 bit microprocessor

More information

Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant

Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant Parallel is the Path Forward Intel Xeon and Intel Xeon Phi Product Families are both going parallel Intel Xeon processor

More information

SIMD Instructions outside and inside Oracle 12c. Laurent Léturgez 2016

SIMD Instructions outside and inside Oracle 12c. Laurent Léturgez 2016 SIMD Instructions outside and inside Oracle 2c Laurent Léturgez 206 Whoami Oracle Consultant since 200 Former developer (C, Java, perl, PL/SQL) Owner@Premiseo: Data Management on Premise and in the Cloud

More information

International Conference Russian Supercomputing Days. September 25-26, 2017, Moscow

International Conference Russian Supercomputing Days. September 25-26, 2017, Moscow International Conference Russian Supercomputing Days September 25-26, 2017, Moscow International Conference Russian Supercomputing Days Supported by the Russian Foundation for Basic Research Platinum Sponsor

More information

Compiling for Scalable Computing Systems the Merit of SIMD. Ayal Zaks Intel Corporation Acknowledgements: too many to list

Compiling for Scalable Computing Systems the Merit of SIMD. Ayal Zaks Intel Corporation Acknowledgements: too many to list Compiling for Scalable Computing Systems the Merit of SIMD Ayal Zaks Intel Corporation Acknowledgements: too many to list Takeaways 1. SIMD is mainstream and ubiquitous in HW 2. Compiler support for SIMD

More information

Vectorization on KNL

Vectorization on KNL Vectorization on KNL Steve Lantz Senior Research Associate Cornell University Center for Advanced Computing (CAC) steve.lantz@cornell.edu High Performance Computing on Stampede 2, with KNL, Jan. 23, 2017

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel

More information

Intel Knights Landing Hardware

Intel Knights Landing Hardware Intel Knights Landing Hardware TACC KNL Tutorial IXPUG Annual Meeting 2016 PRESENTED BY: John Cazes Lars Koesterke 1 Intel s Xeon Phi Architecture Leverages x86 architecture Simpler x86 cores, higher compute

More information

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures Storage I/O Summary Storage devices Storage I/O Performance Measures» Throughput» Response time I/O Benchmarks» Scaling to track technological change» Throughput with restricted response time is normal

More information

INF5063: Programming heterogeneous multi-core processors. Introduction. Håkon Kvale Stensland. August 25 th, 2015

INF5063: Programming heterogeneous multi-core processors. Introduction. Håkon Kvale Stensland. August 25 th, 2015 : Programming heterogeneous multi-core processors Introduction Håkon Kvale Stensland August 25 th, 2015 Overview Course topic and scope Background for the use and parallel processing using heterogeneous

More information

Introduction. No Optimization. Basic Optimizations. Normal Optimizations. Advanced Optimizations. Inter-Procedural Optimizations

Introduction. No Optimization. Basic Optimizations. Normal Optimizations. Advanced Optimizations. Inter-Procedural Optimizations Introduction Optimization options control compile time optimizations to generate an application with code that executes more quickly. Absoft Fortran 90/95 is an advanced optimizing compiler. Various optimizers

More information

Double-precision General Matrix Multiply (DGEMM)

Double-precision General Matrix Multiply (DGEMM) Double-precision General Matrix Multiply (DGEMM) Parallel Computation (CSE 0), Assignment Andrew Conegliano (A0) Matthias Springer (A00) GID G-- January, 0 0. Assumptions The following assumptions apply

More information

Online Course Evaluation. What we will do in the last week?

Online Course Evaluation. What we will do in the last week? Online Course Evaluation Please fill in the online form The link will expire on April 30 (next Monday) So far 10 students have filled in the online form Thank you if you completed it. 1 What we will do

More information

A Hybrid Implementation of Hamming Weight

A Hybrid Implementation of Hamming Weight A Hybrid Implementation of Hamming Weight Enric Morancho Computer Architecture Department Universitat Politècnica de Catalunya, BarcelonaTech Barcelona, Spain enricm@ac.upc.edu 22 nd Euromicro International

More information

Using SSE and IPP to Accelerate Algorithms

Using SSE and IPP to Accelerate Algorithms Using SSE and IPP to Accelerate Algorithms By Sam Siewert Algorithm Acceleration Using SIMD Computing architecture can be described at the highest level using Flynn s architecture classification scheme

More information

Most of the slides in this lecture are either from or adapted from slides provided by the authors of the textbook Computer Systems: A Programmer s

Most of the slides in this lecture are either from or adapted from slides provided by the authors of the textbook Computer Systems: A Programmer s Most of the slides in this lecture are either from or adapted from slides provided by the authors of the textbook Computer Systems: A Programmer s Perspective, 2 nd Edition and are provided from the website

More information

CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions

CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions CS 61C: Great Ideas in Computer Architecture The Flynn Taxonomy, Intel SIMD Instructions Guest Lecturer: Alan Christopher 3/08/2014 Spring 2014 -- Lecture #19 1 Neuromorphic Chips Researchers at IBM and

More information

Overview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC.

Overview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC. Vectorisation James Briggs 1 COSMOS DiRAC April 28, 2015 Session Plan 1 Overview 2 Implicit Vectorisation 3 Explicit Vectorisation 4 Data Alignment 5 Summary Section 1 Overview What is SIMD? Scalar Processing:

More information

CS:APP3e Web Aside OPT:SIMD: Achieving Greater Parallelism with SIMD Instructions

CS:APP3e Web Aside OPT:SIMD: Achieving Greater Parallelism with SIMD Instructions CS:APP3e Web Aside OPT:SIMD: Achieving Greater Parallelism with SIMD Instructions Randal E. Bryant David R. O Hallaron January 14, 2016 Notice The material in this document is supplementary material to

More information

COE608: Computer Organization and Architecture

COE608: Computer Organization and Architecture Add on Instruction Set Architecture COE608: Computer Organization and Architecture Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrical and Computer Engineering Ryerson University Overview More

More information

Intel released new technology call P6P

Intel released new technology call P6P P6 and IA-64 8086 released on 1978 Pentium release on 1993 8086 has upgrade by Pipeline, Super scalar, Clock frequency, Cache and so on But 8086 has limit, Hard to improve efficiency Intel released new

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

Case Study. Optimizing an Illegal Image Filter System. Software. Intel Integrated Performance Primitives. High-Performance Computing

Case Study. Optimizing an Illegal Image Filter System. Software. Intel Integrated Performance Primitives. High-Performance Computing Case Study Software Optimizing an Illegal Image Filter System Intel Integrated Performance Primitives High-Performance Computing Tencent Doubles the Speed of its Illegal Image Filter System using SIMD

More information

Conflict Detection-based Run-Length Encoding AVX-512 CD Instruction Set in Action

Conflict Detection-based Run-Length Encoding AVX-512 CD Instruction Set in Action Conflict Detection-based Run-Length Encoding AVX-512 CD Instruction Set in Action Annett Ungethüm, Johannes Pietrzyk, Patrick Damme, Dirk Habich, Wolfgang Lehner HardBD & Active'18 Workshop in Paris, France

More information

IN5050: Programming heterogeneous multi-core processors SIMD (and SIMT)

IN5050: Programming heterogeneous multi-core processors SIMD (and SIMT) : Programming heterogeneous multi-core processors SIMD (and SIMT) single scull: one is fast quad scull: many are faster Types of Parallel Processing/Computing? Bit-level parallelism 4-bit à 8-bit à 16-bit

More information

CS:APP3e Web Aside OPT:SIMD: Achieving Greater Parallelism with SIMD Instructions

CS:APP3e Web Aside OPT:SIMD: Achieving Greater Parallelism with SIMD Instructions CS:APP3e Web Aside OPT:SIMD: Achieving Greater Parallelism with SIMD Instructions Randal E. Bryant David R. O Hallaron October 12, 2015 Notice The material in this document is supplementary material to

More information

Parallel Processing SIMD, Vector and GPU s

Parallel Processing SIMD, Vector and GPU s Parallel Processing SIMD, ector and GPU s EECS4201 Comp. Architecture Fall 2017 York University 1 Introduction ector and array processors Chaining GPU 2 Flynn s taxonomy SISD: Single instruction operating

More information

Intel Advisor XE. Vectorization Optimization. Optimization Notice

Intel Advisor XE. Vectorization Optimization. Optimization Notice Intel Advisor XE Vectorization Optimization 1 Performance is a Proven Game Changer It is driving disruptive change in multiple industries Protecting buildings from extreme events Sophisticated mechanics

More information

FAST FIR FILTERS FOR SIMD PROCESSORS WITH LIMITED MEMORY BANDWIDTH

FAST FIR FILTERS FOR SIMD PROCESSORS WITH LIMITED MEMORY BANDWIDTH Key words: Digital Signal Processing, FIR filters, SIMD processors, AltiVec. Grzegorz KRASZEWSKI Białystok Technical University Department of Electrical Engineering Wiejska

More information

Growth in Cores - A well rehearsed story

Growth in Cores - A well rehearsed story Intel CPUs Growth in Cores - A well rehearsed story 2 1. Multicore is just a fad! Copyright 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

More information

HPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber,

HPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber, HPC trends (Myths about) accelerator cards & more June 24, 2015 - Martin Schreiber, M.Schreiber@exeter.ac.uk Outline HPC & current architectures Performance: Programming models: OpenCL & OpenMP Some applications:

More information

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines CS450/650 Notes Winter 2013 A Morton Superscalar Pipelines 1 Scalar Pipeline Limitations (Shen + Lipasti 4.1) 1. Bounded Performance P = 1 T = IC CPI 1 cycletime = IPC frequency IC IPC = instructions per

More information

Implementation of DSP Algorithms

Implementation of DSP Algorithms Implementation of DSP Algorithms Main frame computers Dedicated (application specific) architectures Programmable digital signal processors voice band data modem speech codec 1 PDSP and General-Purpose

More information

Computer System Architecture

Computer System Architecture CSC 203 1.5 Computer System Architecture Department of Statistics and Computer Science University of Sri Jayewardenepura Instruction Set Architecture (ISA) Level 2 Introduction 3 Instruction Set Architecture

More information

Introduction to the x86 Architecture. Camiel Vanderhoeven

Introduction to the x86 Architecture. Camiel Vanderhoeven Introduction to the x86 Architecture Camiel Vanderhoeven September 29, 2015 Introduction to the x86 Architecture This information contains forward looking statements and is provided solely for your convenience.

More information

COMPUTER ORGANIZATION & ARCHITECTURE

COMPUTER ORGANIZATION & ARCHITECTURE COMPUTER ORGANIZATION & ARCHITECTURE Instructions Sets Architecture Lesson 5a 1 What are Instruction Sets The complete collection of instructions that are understood by a CPU Can be considered as a functional

More information

Intel MPI Library Conditional Reproducibility

Intel MPI Library Conditional Reproducibility 1 Intel MPI Library Conditional Reproducibility By Michael Steyer, Technical Consulting Engineer, Software and Services Group, Developer Products Division, Intel Corporation Introduction High performance

More information

Data-Level Parallelism in SIMD and Vector Architectures. Advanced Computer Architectures, Laura Pozzi & Cristina Silvano

Data-Level Parallelism in SIMD and Vector Architectures. Advanced Computer Architectures, Laura Pozzi & Cristina Silvano Data-Level Parallelism in SIMD and Vector Architectures Advanced Computer Architectures, Laura Pozzi & Cristina Silvano 1 Current Trends in Architecture Cannot continue to leverage Instruction-Level parallelism

More information

MAQAO hands-on exercises

MAQAO hands-on exercises MAQAO hands-on exercises Perf: generic profiler Perf/MPI: Lightweight MPI oriented profiler CQA: code quality analyzer Setup Recompile NPB-MZ with dynamic if using cray compiler #---------------------------------------------------------------------------

More information

The Era of Heterogeneous Computing

The Era of Heterogeneous Computing The Era of Heterogeneous Computing EU-US Summer School on High Performance Computing New York, NY, USA June 28, 2013 Lars Koesterke: Research Staff @ TACC Nomenclature Architecture Model -------------------------------------------------------

More information

Extending C++ for Explicit Data-Parallel Programming via SIMD Vector Types

Extending C++ for Explicit Data-Parallel Programming via SIMD Vector Types Extending C++ for Explicit Data-Parallel Programming via SIMD Vector Types Dissertation zur Erlangung des Doktorgrades der Naturwissenschaften vorgelegt beim Fachbereich 12 der Johann Wolfgang Goethe-Universität

More information

Advance CPU Design. MMX technology. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. ! Basic concepts

Advance CPU Design. MMX technology. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. ! Basic concepts Computer Architectures Advance CPU Design Tien-Fu Chen National Chung Cheng Univ. Adv CPU-0 MMX technology! Basic concepts " small native data types " compute-intensive operations " a lot of inherent parallelism

More information

How to Write Fast Numerical Code Spring 2013 Lecture: Architecture/Microarchitecture and Intel Core

How to Write Fast Numerical Code Spring 2013 Lecture: Architecture/Microarchitecture and Intel Core How to Write Fast Numerical Code Spring 2013 Lecture: Architecture/Microarchitecture and Intel Core Instructor: Markus Püschel TA: Daniele Spampinato & Alen Stojanov Technicalities Research project: Let

More information

! An alternate classification. Introduction. ! Vector architectures (slides 5 to 18) ! SIMD & extensions (slides 19 to 23)

! An alternate classification. Introduction. ! Vector architectures (slides 5 to 18) ! SIMD & extensions (slides 19 to 23) Master Informatics Eng. Advanced Architectures 2015/16 A.J.Proença Data Parallelism 1 (vector, SIMD ext., GPU) (most slides are borrowed) Instruction and Data Streams An alternate classification Instruction

More information

Machine-level Representation of Programs

Machine-level Representation of Programs Machine-level Representation of Programs Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu

More information

Scientific Computing on GPUs: GPU Architecture Overview

Scientific Computing on GPUs: GPU Architecture Overview Scientific Computing on GPUs: GPU Architecture Overview Dominik Göddeke, Jakub Kurzak, Jan-Philipp Weiß, André Heidekrüger and Tim Schröder PPAM 2011 Tutorial Toruń, Poland, September 11 http://gpgpu.org/ppam11

More information

CSCI 402: Computer Architectures

CSCI 402: Computer Architectures CSCI 402: Computer Architectures Arithmetic for Computers (5) Fengguang Song Department of Computer & Information Science IUPUI What happens when the exact result is not any floating point number, too

More information

Compiling for Scalable Computing Systems the Merit of SIMD. Ayal Zaks Intel Corporation Acknowledgements: too many to list

Compiling for Scalable Computing Systems the Merit of SIMD. Ayal Zaks Intel Corporation Acknowledgements: too many to list Compiling for Scalable Computing Systems the Merit of SIMD Ayal Zaks Intel Corporation Acknowledgements: too many to list Lo F IRST, Thanks for the Technion For Inspiration and Recognition of Science and

More information

Architectures of Flynn s taxonomy -- A Comparison of Methods

Architectures of Flynn s taxonomy -- A Comparison of Methods Architectures of Flynn s taxonomy -- A Comparison of Methods Neha K. Shinde Student, Department of Electronic Engineering, J D College of Engineering and Management, RTM Nagpur University, Maharashtra,

More information

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

COSC 6385 Computer Architecture - Thread Level Parallelism (I) COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month

More information

William Stallings Computer Organization and Architecture 8 th Edition. Micro-programmed Control

William Stallings Computer Organization and Architecture 8 th Edition. Micro-programmed Control William Stallings Computer Organization and Architecture 8 th Edition Chapter 16 Micro-programmed Control Presenters: Andres Borroto Juan Fernandez Laura Verdaguer Control Unit Organization Micro-programmed

More information

Scientific computing with non-standard floating point types

Scientific computing with non-standard floating point types University of Dublin, Trinity College Masters Thesis Scientific computing with non-standard floating point types Author: Vlăduţ Mădălin Druţa Supervisor: Dr. David Gregg A thesis submitted in partial fulfilment

More information

Programmazione Avanzata

Programmazione Avanzata Programmazione Avanzata Vittorio Ruggiero (v.ruggiero@cineca.it) Roma, Marzo 2017 Pipeline Outline CPU: internal parallelism? CPU are entirely parallel pipelining superscalar execution units SIMD MMX,

More information

MAQAO Hands-on exercises LRZ Cluster

MAQAO Hands-on exercises LRZ Cluster MAQAO Hands-on exercises LRZ Cluster LProf: lightweight generic profiler LProf/MPI: Lightweight MPI oriented profiler CQA: code quality analyzer Setup Copy handson material > cp /home/hpc/a2c06/lu23bud/lrz-vihpstw21/tools/maqao/maqao_handson_lrz.tar.xz

More information

Figure 1: 128-bit registers introduced by SSE. 128 bits. xmm0 xmm1 xmm2 xmm3 xmm4 xmm5 xmm6 xmm7

Figure 1: 128-bit registers introduced by SSE. 128 bits. xmm0 xmm1 xmm2 xmm3 xmm4 xmm5 xmm6 xmm7 SE205 - TD1 Énoncé General Instructions You can download all source files from: https://se205.wp.mines-telecom.fr/td1/ SIMD-like Data-Level Parallelism Modern processors often come with instruction set

More information

Kirill Rogozhin. Intel

Kirill Rogozhin. Intel Kirill Rogozhin Intel From Old HPC principle to modern performance model Old HPC principles: 1. Balance principle (e.g. Kung 1986) hw and software parameters altogether 2. Compute Density, intensity, machine

More information

VECTORISATION. Adrian

VECTORISATION. Adrian VECTORISATION Adrian Jackson a.jackson@epcc.ed.ac.uk @adrianjhpc Vectorisation Same operation on multiple data items Wide registers SIMD needed to approach FLOP peak performance, but your code must be

More information

Parallel Processing SIMD, Vector and GPU s

Parallel Processing SIMD, Vector and GPU s Parallel Processing SIMD, Vector and GPU s EECS4201 Fall 2016 York University 1 Introduction Vector and array processors Chaining GPU 2 Flynn s taxonomy SISD: Single instruction operating on Single Data

More information

The von Neumann Architecture. IT 3123 Hardware and Software Concepts. The Instruction Cycle. Registers. LMC Executes a Store.

The von Neumann Architecture. IT 3123 Hardware and Software Concepts. The Instruction Cycle. Registers. LMC Executes a Store. IT 3123 Hardware and Software Concepts February 11 and Memory II Copyright 2005 by Bob Brown The von Neumann Architecture 00 01 02 03 PC IR Control Unit Command Memory ALU 96 97 98 99 Notice: This session

More information

MAQAO Hands-on exercises FROGGY Cluster

MAQAO Hands-on exercises FROGGY Cluster MAQAO Hands-on exercises FROGGY Cluster LProf: lightweight generic profiler LProf/MPI: Lightweight MPI oriented profiler CQA: code quality analyzer Setup Copy handson material > cp /home/projects/pr-vi-hps-tw18/tutorial/maqao.tar.bz2

More information

Introduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA

Introduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA Introduction to the Xeon Phi programming model Fabio AFFINITO, CINECA What is a Xeon Phi? MIC = Many Integrated Core architecture by Intel Other names: KNF, KNC, Xeon Phi... Not a CPU (but somewhat similar

More information

Advanced Computer Architecture Lab 4 SIMD

Advanced Computer Architecture Lab 4 SIMD Advanced Computer Architecture Lab 4 SIMD Moncef Mechri 1 Introduction The purpose of this lab assignment is to give some experience in using SIMD instructions on x86. We will

More information

A study on SIMD architecture

A study on SIMD architecture A study on SIMD architecture Gürkan Solmaz, Rouhollah Rahmatizadeh and Mohammad Ahmadian Department of Electrical Engineering and Computer Science University of Central Florida Email: {gsolmaz,rrahmati,mohammad}@knights.ucf.edu

More information

Vectorized implementations of post-quantum crypto

Vectorized implementations of post-quantum crypto Vectorized implementations of post-quantum crypto Peter Schwabe January 12, 2015 DIMACS Workshop on the Mathematics of Post-Quantum Cryptography The multicore revolution Until early years 2000 each new

More information

CSCI-GA Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs

CSCI-GA Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com History of GPUs

More information

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends Collaborators: Richard T. Mills, Argonne National Laboratory Sarat Sreepathi, Oak Ridge National Laboratory Forrest M. Hoffman, Oak Ridge National Laboratory Jitendra Kumar, Oak Ridge National Laboratory

More information