Lecture 16 SSE vectorprocessing SIMD MultimediaExtensions

Similar documents
Lecture 14 The C++ Memory model Implementing synchronization SSE vector processing (SIMD Multimedia Extensions)

CSE 160 Lecture 10. Instruction level parallelism (ILP) Vectorization

SIMD Programming CS 240A, 2017

Flynn Taxonomy Data-Level Parallelism

CS 61C: Great Ideas in Computer Architecture

Lecture 3. Vectorization Memory system optimizations Performance Characterization

Review: Parallelism - The Challenge. Agenda 7/14/2011

Lecture 4. Instruction Level Parallelism Vectorization, SSE Optimizing for the memory hierarchy

CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Data Level Parallelism

CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Data Level Parallelism

Lecture 2. Memory locality optimizations Address space organization

Lecture 16 Optimizing for the memory hierarchy

CS 110 Computer Architecture

Lecture 18. Optimizing for the memory hierarchy

CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions

Lecture 15 Floating point

Review of Last Lecture. CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions. Great Idea #4: Parallelism.

CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions

Lecture 8. Vector Processing. 8.1 Flynn s taxonomy. M. J. Flynn proposed a categorization of parellel computer systems in 1966.

DLP, Amdahl s Law, Intro to Multi-thread/processor systems Instructor: Nick Riasanovsky

Parallelism and Performance Instructor: Steven Ho

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

High Performance Computing and Programming 2015 Lab 6 SIMD and Vectorization

Lecture 4. Performance: memory system, programming, measurement and metrics Vectorization

Assembly Language - SSE and SSE2. Floating Point Registers and x86_64

Assembly Language - SSE and SSE2. Introduction to Scalar Floating- Point Operations via SSE

Vector Processors and Graphics Processing Units (GPUs)

Dr. Joe Zhang PDC-3: Parallel Platforms

Algorithms and Computation in Signal Processing

Agenda. Review. Costs of Set-Associative Caches

CS4961 Parallel Programming. Lecture 7: Introduction to SIMD 09/14/2010. Homework 2, Due Friday, Sept. 10, 11:59 PM. Mary Hall September 14, 2010

Ge#ng Started with Automa3c Compiler Vectoriza3on. David Apostal UND CSci 532 Guest Lecture Sept 14, 2017

SSE and SSE2. Timothy A. Chagnon 18 September All images from Intel 64 and IA 32 Architectures Software Developer's Manuals

Using Intel AVX without Writing AVX

UNIVERSITI SAINS MALAYSIA. CCS524 Parallel Computing Architectures, Algorithms & Compilers

Static Compiler Optimization Techniques

Lecture 3. Programming with GPUs

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

Figure 1: 128-bit registers introduced by SSE. 128 bits. xmm0 xmm1 xmm2 xmm3 xmm4 xmm5 xmm6 xmm7

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Advanced Computer Architecture

A Multiprocessor system generally means that more than one instruction stream is being executed in parallel.

Name: PID: CSE 160 Final Exam SAMPLE Winter 2017 (Kesden)

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines

Parallel Processing SIMD, Vector and GPU s

CSE 262 Spring Scott B. Baden. Lecture 1 Introduction

Parallel Processing SIMD, Vector and GPU s

Lecture 25: Interrupt Handling and Multi-Data Processing. Spring 2018 Jason Tang

Most of the slides in this lecture are either from or adapted from slides provided by the authors of the textbook Computer Systems: A Programmer s

ARMv8-A Scalable Vector Extension for Post-K. Copyright 2016 FUJITSU LIMITED

Lecture 3. SIMD and Vectorization GPU Architecture

The Flynn Taxonomy, Intel SIMD Instruc(ons Miki Lus(g and Dan Garcia

Compiling for Nehalem (the Intel Core Microarchitecture, Intel Xeon 5500 processor family and the Intel Core i7 processor)

CS 261 Fall Mike Lam, Professor. x86-64 Data Structures and Misc. Topics

Masterpraktikum Scientific Computing

Advanced Computer Architecture Lab 4 SIMD

OpenCL Vectorising Features. Andreas Beckmann

/INFOMOV/ Optimization & Vectorization. J. Bikker - Sep-Nov Lecture 5: SIMD (1) Welcome!

PIPELINE AND VECTOR PROCESSING

Parallel Computer Architectures. Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam

Part VII Advanced Architectures. Feb Computer Architecture, Advanced Architectures Slide 1

Computer Architecture: SIMD and GPUs (Part I) Prof. Onur Mutlu Carnegie Mellon University

MACHINE-LEVEL PROGRAMMING IV: Computer Organization and Architecture

CS 152 Computer Architecture and Engineering. Lecture 16: Graphics Processing Units (GPUs)

Cell Programming Tips & Techniques

CS4961 Parallel Programming. Lecture 9: Task Parallelism in OpenMP 9/22/09. Administrative. Mary Hall September 22, 2009.

SWAR: MMX, SSE, SSE 2 Multiplatform Programming

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Parallel Systems I The GPU architecture. Jan Lemeire

High Performance Computing. University questions with solution

VECTORISATION. Adrian

Parallel Programming. Easy Cases: Data Parallelism

Introduction to Computing and Systems Architecture

Memory access patterns. 5KK73 Cedric Nugteren

High Performance Computing: Tools and Applications

Auto-Vectorization with GCC

Taming the inner loop

High Performance Computing in C and C++

The Heterogeneous Programming Jungle. Service d Expérimentation et de développement Centre Inria Bordeaux Sud-Ouest

COSC 6385 Computer Architecture. - Data Level Parallelism (II)

Intel 64 and IA-32 Architectures Software Developer s Manual

Advanced Computer Architecture. The Architecture of Parallel Computers

CS 152 Computer Architecture and Engineering. Lecture 16: Graphics Processing Units (GPUs)

Data-Level Parallelism in SIMD and Vector Architectures. Advanced Computer Architectures, Laura Pozzi & Cristina Silvano

Online Course Evaluation. What we will do in the last week?

Computer Organization and Design, 5th Edition: The Hardware/Software Interface

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming. Lecturer: Alan Christopher

! An alternate classification. Introduction. ! Vector architectures (slides 5 to 18) ! SIMD & extensions (slides 19 to 23)

Parallel Computing Architectures

SSE/SSE2 Toolbox Solutions for Real-Life SIMD Problems

Parallel Computing: Parallel Architectures Jin, Hai

Program Optimization Through Loop Vectorization

COE608: Computer Organization and Architecture

Power Estimation of UVA CS754 CMP Architecture

last time out-of-order execution and instruction queues the data flow model idea

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

1/26/09. Administrative. L4: Hardware Execution Model and Overview. Recall Execution Model. Outline. First assignment out, due Friday at 5PM

SIMD Instructions outside and inside Oracle 12c. Laurent Léturgez 2016

Parallel Computing Platforms

HY425 Lecture 09: Software to exploit ILP

Transcription:

Lecture 16 SSE vectorprocessing SIMD MultimediaExtensions

Improving performance with SSE We ve seen how we can apply multithreading to speed up the cardiac simulator But there is another kind of parallelism available to us: SSE 25

Hardware Control Mechanisms Flynn s classification (1966) How do the processors issue instructions? PE+ CU SIMD: Single Instruction, Multiple Data Execute a global instruction stream in lock-step PE+ CU PE+ CU Interconnect PE+ CU PE PE+ CU Control Unit PE PE PE Interconnect MIMD: Multiple Instruction, Multiple Data Clusters and servers processors execute instruction streams independently PE 26

SIMD (Single Instruction Multiple Data) Operate on regular arrays of data Two landmark SIMDdesigns ILIAC IV (1960s) Connection Machine 1 and 2 (1980s) Vector computer: Cray-1 (1976) Intel and others support SIMD for multimedia andgraphics SSE Streaming SIMD extensions, Altivec Operations defined on vectors GPUs, Cell Broadband Engine (Sony Playstation) Reduced performance on data dependent or irregularcomputations 1 1 4 2 = * 2 2 6 3 forall i = 0:N-1 p[i] = a[i] * b[i] 1 2 1 2 forall i = 0 : n-1 x[i] = y[i] + z [ K[i] ] end forall forall i = 0 : n-1 if ( x[i] < 0) then y[i] = x[i] else y[i] = x[i] end if end forall 27

Are SIMD processorsgeneral purpose? A.Yes B. No 28

Are SIMD processorsgeneral purpose? A.Yes B. No 28

What kind of parallelism does multithreading provide? A. MIMD B. SIMD 29

What kind of parallelism does multithreading provide? A. MIMD B. SIMD 29

Streaming SIMD Extensions SIMD instruction set on short vectors SSE: SSE3 on Bang, but most will need only SSE2 See https://goo.gl/diokkj and https://software.intel.com/sites/landingpage/intrinsicsguide Bang : 8x128 bit vector registers (newer cpus have 16) for i = 0:N-1 { p[i] = a[i] *b[i];} 1 1 X X X X a b p 4 2 = * 2 2 6 3 1 2 1 2 4 doubles 8 floats, ints etc 30

SSE Architectural support SSE2,SSE3, SSE4,AVX Vector operations on short vectors: add, subtract, 128 bit load store SSE2+: 16 XMM registers (128 bits) These are in addition to the conventional registers and are treated specially Vector operations on short vectors: add, subtract, Shuffling (handles conditionals) Data transfer: load/store See the Intel intrisics guide: software.intel.com/sites/landingpage/intrinsicsguide May need to invoke compiler options depending on level of optimization 31

C++ intrinsics C++ functions and datatypes that map directly onto 1 or more machine instructions Supported by all major compilers The interface provides 128 bit data types and operations on those datatypes _m128 (float) _m128d (double) for (i=0; i<n; i+=2) { Data movement and initialization mm_load_pd (aligned load) mm_store_pd mm_loadu_pd (unaligned load) } Data may need to be aligned m128d vec1, vec2, vec3; vec1 = _mm_load_pd(&b[i]); vec2 = _mm_load_pd(&c[i]); vec3 = _mm_div_pd(vec1, vec2); vec3 = _mm_sqrt_pd(vec3); _mm_store_pd(&a[i], vec3); 32

Original code double a[n], b[n], c[n]; for (i=0; i<n; i++) { How do we vectorize? a[i] = sqrt(b[i] / c[i]); Identify vector operations, reduce loop bound for (i = 0; i < N; i+=2) a[i:i+1] = vec_sqrt(b[i:i+1] / c[i:i+1]); The vector instructions m128d vec1, vec2, vec3; for (i=0; i<n; i+=2) { vec1 = _mm_load_pd(&b[i]); vec2 = _mm_load_pd(&c[i]); vec3 = _mm_div_pd(vec1, vec2); vec3 = _mm_sqrt_pd(vec3); } _mm_store_pd(&a[i], vec3); 33

double *a, *b, *c for (i=0; i<n; i++) { } a[i] = sqrt(b[i] /c[i]); Performance Without SSE vectorization : 0.777 sec. With SSE vectorization : 0.454 sec. Speedup due to vectorization: x1.7 $PUB/Examples/SSE/Vec double *a, *b, *c m128d vec1, vec2, vec3; for (i=0; i<n; i+=2) { vec1 = _mm_load_pd(&b[i]); vec2 = _mm_load_pd(&c[i]); vec3 = _mm_div_pd(vec1, vec2); vec3 = _mm_sqrt_pd(vec3); } _mm_store_pd(&a[i], vec3); 34

The assembler code double *a, *b, *c for (i=0; i<n; i++) { a[i] = sqrt(b[i] /c[i]); } double *a, *b, *c m128d vec1, vec2, vec3; for (i=0; i<n; i+=2) { vec1 = _mm_load_pd(&b[i]); vec2 = _mm_load_pd(&c[i]); vec3 = _mm_div_pd(vec1, vec2); vec3 = _mm_sqrt_pd(vec3); _mm_store_pd(&a[i], vec3); }.L12: movsd xmm0, QWORD PTR [r12+rbx] divsd xmm0, QWORD PTR [r13+0+rbx] sqrtsd xmm1, xmm0 ucomisd xmm1, xmm1 // checks for illegal sqrt jp.l30 movsd add cmp jne QWORD PTR [rbp+0+rbx], xmm1 rbx, 8 # ivtmp.135 rbx, 16384.L12 35

Interrupted flow out of the loop for (i=0; i<n; i++) { a[i] = b[i] + c[i]; maxval = (a[i] > maxval? a[i] :maxval); if (maxval > 1000.0) break; } What prevents vectorization Loop not vectorized/parallelized: multiple exits This loop will vectorize for (i=0; i<n; i++) { a[i] = b[i] + c[i]; maxval = (a[i] > maxval? a[i] :maxval); } 36

SSE2 Cheat sheet (load and store) xmm: one operand is a 128-bit SSE2 register mem/xmm: other operand is in memory or an SSE2register {SS} Scalar Single precision FP: one 32-bit operand in a 128-bit register {PS} Packed Single precision FP: four 32-bit operands in a 128-bit register {SD} Scalar Double precision FP: one 64-bit operand in a 128-bit register {PD} Packed Double precision FP, or two 64-bit operands in a 128-bit register {A} 128-bit operand is aligned inmemory {U} the 128-bit operand is unaligned inmemory {H} move the high half of the 128-bit operand Krste Asanovic & Randy H.Katz {L} move the low half of the 128-bit operand 37