Parallel Processing SIMD, Vector and GPU s

Similar documents
Parallel Processing SIMD, Vector and GPU s

Data-Level Parallelism in SIMD and Vector Architectures. Advanced Computer Architectures, Laura Pozzi & Cristina Silvano

! An alternate classification. Introduction. ! Vector architectures (slides 5 to 18) ! SIMD & extensions (slides 19 to 23)

COSC 6385 Computer Architecture. - Vector Processors

UNIT III DATA-LEVEL PARALLELISM IN VECTOR, SIMD, AND GPU ARCHITECTURES

EE 4683/5683: COMPUTER ARCHITECTURE

Vector Architectures. Intensive Computation. Annalisa Massini 2017/2018

Data-Level Parallelism in Vector and SIMD Architectures

Computer Architecture: SIMD and GPUs (Part I) Prof. Onur Mutlu Carnegie Mellon University

Data-Level Parallelism in Vector and SIMD Architectures

Static Compiler Optimization Techniques

Chapter 4 Data-Level Parallelism

ILP: Instruction Level Parallelism

Computer Architecture Lecture 16: SIMD Processing (Vector and Array Processors)

Data-Parallel Architectures

ELE 455/555 Computer System Engineering. Section 4 Parallel Processing Class 1 Challenges

Adapted from David Patterson s slides on graduate computer architecture

Vector Processors. Abhishek Kulkarni Girish Subramanian

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines

Pipelining and Vector Processing

Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

CMPE 655 Multiple Processor Systems. SIMD/Vector Machines. Daniel Terrance Stephen Charles Rajkumar Ramadoss

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Pipelining and Vector Processing

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

Vector Processors. Kavitha Chandrasekar Sreesudhan Ramkumar

Introduction to Vector Processing

EE 4683/5683: COMPUTER ARCHITECTURE

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

Good luck and have fun!

CS 614 COMPUTER ARCHITECTURE II FALL 2005

EECC551 Exam Review 4 questions out of 6 questions

Computer System Architecture Quiz #5 December 14th, 2005 Professor Arvind Dr. Joel Emer

COSC 6385 Computer Architecture - Pipelining (II)

Asanovic/Devadas Spring Vector Computers. Krste Asanovic Laboratory for Computer Science Massachusetts Institute of Technology

Lecture 21: Data Level Parallelism -- SIMD ISA Extensions for Multimedia and Roofline Performance Model

ILP Limit: Perfect/Infinite Hardware. Chapter 3: Limits of Instr Level Parallelism. ILP Limit: see Figure in book. Narrow Window Size

Updated Exercises by Diana Franklin

CS152 Computer Architecture and Engineering VLIW, Vector, and Multithreaded Machines

CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions

Review of Last Lecture. CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions. Great Idea #4: Parallelism.

Four Steps of Speculative Tomasulo cycle 0

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions

EC 513 Computer Architecture

Computer Architecture Practical 1 Pipelining

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S

Instruction-Level Parallelism (ILP)

Instruction Level Parallelism

Vector an ordered series of scalar quantities a one-dimensional array. Vector Quantity Data Data Data Data Data Data Data Data

Complications with long instructions. CMSC 411 Computer Systems Architecture Lecture 6 Basic Pipelining 3. How slow is slow?

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Lecture 4. Instruction Level Parallelism Vectorization, SSE Optimizing for the memory hierarchy

Review: Evaluating Branch Alternatives. Lecture 3: Introduction to Advanced Pipelining. Review: Evaluating Branch Prediction

Lecture 20: Data Level Parallelism -- Introduction and Vector Architecture

Static vs. Dynamic Scheduling

Processor: Superscalars Dynamic Scheduling

Adapted from instructor s. Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK]

Hardware-based Speculation

CMSC 411 Computer Systems Architecture Lecture 6 Basic Pipelining 3. Complications With Long Instructions

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

TDT 4260 lecture 9 spring semester 2015

CS 614 COMPUTER ARCHITECTURE II FALL 2004

Flynn Taxonomy Data-Level Parallelism

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

Solutions to exercises on Instruction Level Parallelism

Floating Point/Multicycle Pipelining in DLX

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

MIPS ISA AND PIPELINING OVERVIEW Appendix A and C

Superscalar Architectures: Part 2

Computer Architecture: SIMD and GPUs (Part II) Prof. Onur Mutlu Carnegie Mellon University

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline

CS 152 Computer Architecture and Engineering. Lecture 17: Vector Computers

SIMD Programming CS 240A, 2017

Predict Not Taken. Revisiting Branch Hazard Solutions. Filling the delay slot (e.g., in the compiler) Delayed Branch

Scoreboard information (3 tables) Four stages of scoreboard control

Lecture: Pipeline Wrap-Up and Static ILP

Computer Organization and Design, 5th Edition: The Hardware/Software Interface

What is ILP? Instruction Level Parallelism. Where do we find ILP? How do we expose ILP?

CSE 160 Lecture 10. Instruction level parallelism (ILP) Vectorization

Design of Digital Circuits Lecture 20: SIMD Processors. Prof. Onur Mutlu ETH Zurich Spring May 2017

ETH, Design of Digital Circuits, SS17 Practice Exercises II - Solutions

Multi-cycle Instructions in the Pipeline (Floating Point)

CS 2410 Mid term (fall 2018)

CS 352H Computer Systems Architecture Exam #1 - Prof. Keckler October 11, 2007

Parallel Systems I The GPU architecture. Jan Lemeire

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Computer Architecture and Engineering CS152 Quiz #4 April 11th, 2011 Professor Krste Asanović

ELE 818 * ADVANCED COMPUTER ARCHITECTURES * MIDTERM TEST *

Topics. Digital Systems Architecture EECE EECE Software Approaches to ILP Part 2. Ideas To Reduce Stalls. Processor Case Studies

DYNAMIC INSTRUCTION SCHEDULING WITH SCOREBOARD

Problem Set 4 Solutions

Course on Advanced Computer Architectures

CS 252 Graduate Computer Architecture. Lecture 7: Vector Computers

Problem M5.1: VLIW Programming

Transcription:

Parallel Processing SIMD, ector and GPU s EECS4201 Comp. Architecture Fall 2017 York University 1 Introduction ector and array processors Chaining GPU 2 Flynn s taxonomy SISD: Single instruction operating on Single Data SIMD: Single instruction operating on Multiple Data MISD: Multiple instruction operating on Single Data MIMD: Multiple instructions operating on Multiple Data 3 1

SIMD SIMD architectures can exploit significant data level parallelism for: matrix oriented scientific computing media oriented image and sound processors SIMD is more energy efficient than MIMD Only needs to fetch one instruction per data operation Makes SIMD attractive for personal mobile devices SIMD allows programmer to continue to think sequentially 4 ector vs. Array Processors Array processors same instruction operating on many data elements at the same time (space) ector processors Same instruction operating on many data in a pipeline fashion (what is the difference between this and regular pipelined processors?) 5 ector Processors Cray 1 was the first commercially successful vector processor Can afford very deep pipelines. There is no dependence. 6 2

Cray 1 7 MIPS Example architecture: MIPS Loosely based on Cray 1 ector registers Each register holds a 64 element, 64 bits/element vector Register file has 16 read ports and 8 write ports ector functional units Fully pipelined Data and control hazards are detected ector load store unit Fully pipelined One word per clock cycle after initial latency Scalar registers 32 general purpose registers 32 floating point registers 8 MIPS 9 3

MIPS Instructions ADD.D 1,2,3 add two vectors ADDS.D 1,2,F0 add vector to a scalar L 1,R1 vector load from address S R1,1 ector store at R1 MUL.D 1,2,3 vector multiply DI.D 1,2,3 ector div (element by element) LWS 1,(R1,R2) Load vector from R1, stride=r2 LI 1,(R1+2) Load 1 with elements at R1+2(i) CI 1,R1 create an index vector in 1 (0, R1, 2R1,3R1, SEQ.D 1,2 Compare elements 1,2 0 or 1in M EQ, NE, GT, MTM M,F0 Move contents of F0 to vec. mask reg. MTCI LR,R1 Move r1 to vector length register 10 ector Processing ADD 3, 1, 2 After an initial latency (depth of pipeline) we get one result per cycle. We can do this with a simple loop, what is the difference? 1 2 3 11 ector Execution Time Execution time depends on three factors: Length of operand vectors Structural hazards Data dependencies MIPS functional units consume one element per clock cycle Execution time is approximately the vector length Convey Set of vector instructions that could potentially execute together (no structural hazards, could be more than one instruction) 12 4

Chimes Sequences with read after write dependency hazards can be in the same convey via chaining Chaining Allows a vector operation to start as soon as the individual elements of its vector source operand become available Chime Unit of time to execute one convey m conveys executes in m chimes For vector length of n, requires m x n clock cycles 13 Example L 1,Rx ;load vector X MULS.D 2,1,F0 ;vector scalar multiply L 3,Ry ;load vector Y ADD.D 4,2,3 ;add two vectors S Ry,4 ;store the sum Convoys: L 1,Rx ;load vector X MULS.D 2,1,F0 ;vector scalar multiply L 3,Ry ;load vector Y ADD.D 4,2,3 ;add two vectors S Ry,4 ;store the sum 4 conveys => 4 x 64» 256 clocks (or 4 clocks per result) 14 Example Consider the following example: For (i=0;i<50.i++) c[i] = (a[i] + b[i])/2 Sequence of improvements from in order execution with one bank to chained vector processor with multiple banks 15 5

Assembly Code Initialize registers R0, R1, R2, R3 LOOP LD R4, 0(R1) 11 LD R5, 0(R2) 11 ADD R6,R4,R5 4 SR R6, R6, 1 1 // shift right ST R6, 0(R3) 11 ADDI R1,R1,4 1 ADDI R2, R2, 4 1 ADDI R3, R3, 4 1 ADDI R0, R0, 1 1 BEQZ R0, LOOP 2 = 44*50 16 ector Code The loop is vectorizable Initialize registers (including _length and stride) 5+5 dynamic instruction L 1, R1 11+50 1 L 2, R2 11+50 1 ADD 3, 1, 2 4+50 1 SL 3, 3, 1 1+50 1 // shift R S 3, R4 11+50 1 =293 17 ector Code Chaining: No need to wait until the vector register is loaded, you can start after the first element is ready. How long Does it takes for the previous case? 18 6

Chaining L ADD MUL 1 3, 1, 2 5, 3, 4 1 2 3 4 5 LOAD MEM 19 ector Code Chaining 1 Port The loop is vectorizable Initialize registers (including _length and stride) 5+5 dynamic instruction L 1, R1 11+50 1 L 2, R2 11+50 1 ADD 3, 1, 2 4+50 1 SL 3, 3, 1 1+50 1 // shift R S 3, R4 11+50 1 =293 20 ector chaining: Data forwarding from one vector functional unit to another 1 1 11 49 11 49 These two LDs cannot be pipelined. WHY? 4 49 1 49 Strict assumption: Each memory bank has a single port (memory bandwidth bottleneck) 21 182 cycles LD and ST cannot be pipelined. WHY? 11 49 7

ector Code Performance Multiple Memory Ports Chaining and 2 load ports, 1 store port in each bank 22 79 cycles 19X perf. improvement! ector Code Chaining and 2 memory banks? 23 ector length In the previous example, the vector length is less than the REG length. What if more (operation on a vector of 1000 elements) Loops each load perform on a 64 element vector (need to adjust vector length in the last iteration) 24 8

ector Stripmining ector length not known at compile time? Use ector Length Register (LR) Use strip mining for vectors over the maximum length: low = 0; L = (n % ML); /*find odd-size piece using modulo op % */ for (j = 0; j <= (n/ml); j=j+1) { /*outer loop*/ for (i = low; i < (low+l); i=i+1) /*runs for length L*/ Y[i] = a * X[i] + Y[i] ; /*main operation*/ low = low + L; /*start of next vector*/ L = ML; /*reset the length to maximum vector length*/ } 25 Effect of Memory Load/store unit is more complicated than FU s Start up time, is the time for the first word into a register Memory system must be designed to support high bandwidth for vector loads and stores Spread accesses across multiple banks Control bank addresses independently Load or store non sequential words Support multiple vector processors sharing the same memory Example: 32 processors, each generating 4 loads and 2 stores/cycle Processor cycle time is 2.167 ns, SRAM cycle time is 15 ns How many memory banks needed? 26 Example Cray T932 has 32 processors. Each processor is capable of generating 4 loads and 2 stores per clock cycle. Clock cycle is 2.167 ns, SRAM cycle time 15 ns. How many bank do we need to allow the system to run at a full memory bandwidth? 27 9

Example 8 memory banks, bank busy time 6 cycles, total memory latency 12 cycles. What is the difference between a 64 element vector load with a stride of 1 and 32? 28 Stride Consider: for (i = 0; i < 100; i=i+1) for (j = 0; j < 100; j=j+1) { A[i][j] = 0.0; for (k = 0; k < 100; k=k+1) A[i][j] = A[i][j] + B[i][k] * D[k][j]; } Must vectorize multiplication of rows of B with columns of D Use non unit stride Bank conflict (stall) occurs when the same bank is hit faster than bank busy time: #banks / LCM(stride,#banks) < bank busy time 29 Strides Add in a bank SE Q M O D 0 1 2 3 0 1 2 0 1 2 0 0 1 2 3 0 1 2 0 16 8 1 4 5 6 7 3 4 5 9 1 17 2 8 9 10 11 6 7 8 18 10 2 3 12 13 14 15 9 10 11 3 19 11 4 16 17 18 19 12 13 14 12 4 20 5 20 21 22 23 15 16 17 21 13 5 6 24 25 26 27 18 19 20 6 22 14 7 28 29 30 31 21 22 23 15 7 23 30 10

Strides MOD can be calculated very efficiently if the prime number is 1 less than a power of 2. Division still a problem But if we change the mapping such that Address in a bank = address MOD number of words in a bank. Since the number of words in a bank is usually a power of 2, that will lead to a very efficient implementation. Consider the following example, the first case is the usual 4 banks, then 3 banks with sequential interleaving and modulo interleaving and notice the conflict free access to rows and columns of a 4 by 4 matrix 31 ector Mask Register What if we have a conditional IF statement inside the loop? Using scalar architecture, that introduces control dependence. The vector mask control: A mask register is used to conditionally execute using a Boolean condition. When the vector mask register is enabled, any vector instruction executed operate only on vector elements whose corresponding entries in the MR are ones. The rest of the elements are unaffected. Clearing the vector mask register, sets to all 1 s and operations are performed on all the elements. Does not save execution time for masked elements 32 ector Mask Register Consider: for (i = 0; i < 64; i=i+1) if (X[i]!= 0) X[i] = X[i] Y[i]; Use vector mask register to disable elements: L 1,Rx ;load vector X into 1 L 2,Ry ;load vector Y L.D F0,#0 ;load FP zero into F0 SNES.D 1,F0 ;sets M(i) to 1 if 1(i)!=F0 SUB.D 1,1,2 ;subtract under vector mask S Rx,1 ;store the result in X 33 11

ector mask Register For(i=0; i<50; i++) if (a[i]!= 0) b[i]=a[i]*b[i] LD 0, 0(A) LD 1, 0(B) Set MASK = (0!= 0) MUL 1, 0, 1 ST 1, 0(B) 34 Scatter Gather Consider: for (i = 0; i < n; i=i+1) A[K[i]] = A[K[i]] + C[M[i]]; Use index vector: L k, Rk ;load K LI a, (Ra+k) ;load A[K[]] L m, Rm ;load M LI c, (Rc+m) ;load C[M[]] ADD.D a, a, c ;add them SI (Ra+k), a ;store A[K[]] 35 Scatter gather a[j]=a[d[j]] LI a, (Ra+k) ;load A[K[]] Ra is the starting address of D k 0 2 5 7 0 1 2 3 4 5 6 7 A in memory 15.1 12.3 4.2 3.67 11.2 15.0 5.9 2.3 a 15.1 4.2 150 2.3 36 12

Multiple lanes Operations are interleaved across multiple lanes. Allows for multiple hardware lanes and no changes to machine code 37 Not Quite SIMD Intel extension MMx, SSE, AX, PowerPC Altiec, ARM Advanced SIMD No vector length, just depends on the instruction, the register can be considered 16 8 bit numbers, 8 16 bit numbers, 38 13