COSC 6385 Computer Architecture. - Vector Processors

Similar documents
Data-Level Parallelism in SIMD and Vector Architectures. Advanced Computer Architectures, Laura Pozzi & Cristina Silvano

Parallel Processing SIMD, Vector and GPU s

! An alternate classification. Introduction. ! Vector architectures (slides 5 to 18) ! SIMD & extensions (slides 19 to 23)

Vector Architectures. Intensive Computation. Annalisa Massini 2017/2018

EE 4683/5683: COMPUTER ARCHITECTURE

UNIT III DATA-LEVEL PARALLELISM IN VECTOR, SIMD, AND GPU ARCHITECTURES

Data-Level Parallelism in Vector and SIMD Architectures

Static Compiler Optimization Techniques

Parallel Processing SIMD, Vector and GPU s

Vector Processors. Abhishek Kulkarni Girish Subramanian

CS 614 COMPUTER ARCHITECTURE II FALL 2005

Introduction to Vector Processing

Data-Level Parallelism in Vector and SIMD Architectures

ILP Limit: Perfect/Infinite Hardware. Chapter 3: Limits of Instr Level Parallelism. ILP Limit: see Figure in book. Narrow Window Size

CS 252 Graduate Computer Architecture. Lecture 7: Vector Computers

CS 152 Computer Architecture and Engineering. Lecture 17: Vector Computers

Vector Processors. Kavitha Chandrasekar Sreesudhan Ramkumar

Computer Architecture: SIMD and GPUs (Part I) Prof. Onur Mutlu Carnegie Mellon University

Asanovic/Devadas Spring Vector Computers. Krste Asanovic Laboratory for Computer Science Massachusetts Institute of Technology

TDT 4260 lecture 9 spring semester 2015

ELE 455/555 Computer System Engineering. Section 4 Parallel Processing Class 1 Challenges

Computer Architecture Lecture 16: SIMD Processing (Vector and Array Processors)

CMPE 655 Multiple Processor Systems. SIMD/Vector Machines. Daniel Terrance Stephen Charles Rajkumar Ramadoss

CS 152 Computer Architecture and Engineering. Lecture 16: Vector Computers

Computer Architecture and Engineering. CS152 Quiz #5. April 23rd, Professor Krste Asanovic. Name: Answer Key

ESE 545 Computer Architecture. Data Level Parallelism (DLP), Vector Processing and Single-Instruction Multiple Data (SIMD) Computing

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines

Lecture 20: Data Level Parallelism -- Introduction and Vector Architecture

Computer Architecture and Engineering CS152 Quiz #4 April 11th, 2011 Professor Krste Asanović

Computer System Architecture Quiz #5 December 14th, 2005 Professor Arvind Dr. Joel Emer

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Advanced Topics in Computer Architecture

Parallel Systems I The GPU architecture. Jan Lemeire

Lecture 21: Data Level Parallelism -- SIMD ISA Extensions for Multimedia and Roofline Performance Model

CS152 Computer Architecture and Engineering VLIW, Vector, and Multithreaded Machines

Adapted from David Patterson s slides on graduate computer architecture

Vector an ordered series of scalar quantities a one-dimensional array. Vector Quantity Data Data Data Data Data Data Data Data

Adapted from instructor s. Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK]

Static vs. Dynamic Scheduling

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

Computer Architecture Practical 1 Pipelining

CS425 Computer Systems Architecture

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

CS 614 COMPUTER ARCHITECTURE II FALL 2004

Pipelining and Vector Processing

06-2 EE Lecture Transparency. Formatted 14:50, 4 December 1998 from lsli

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Computer Architecture 计算机体系结构. Lecture 10. Data-Level Parallelism and GPGPU 第十讲 数据级并行化与 GPGPU. Chao Li, PhD. 李超博士

EE 4683/5683: COMPUTER ARCHITECTURE

Lecture: Static ILP. Topics: loop unrolling, software pipelines (Sections C.5, 3.2)

Floating Point/Multicycle Pipelining in DLX

Lecture 10: Static ILP Basics. Topics: loop unrolling, static branch prediction, VLIW (Sections )

DATA-LEVEL PARALLELISM IN VECTOR, SIMD ANDGPU ARCHITECTURES(PART 2)

Processor: Superscalars Dynamic Scheduling

Lecture: Pipeline Wrap-Up and Static ILP

Spring 2011 CSE Qualifying Exam Core Subjects. February 26, 2011

CPE 631 Advanced Computer Systems Architecture: Homework #3,#4

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Parallel Processors from Client to Cloud Part 1

Hardware-based Speculation

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Computer Science 246 Computer Architecture

COSC 6385 Computer Architecture - Review for the 2 nd Quiz

Lecture: Static ILP. Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

COSC 6385 Computer Architecture. Instruction Level Parallelism

EECC551 Exam Review 4 questions out of 6 questions

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Problem M5.1: VLIW Programming

4.10 Historical Perspective and References

CS 152 Computer Architecture and Engineering. Lecture 16: Graphics Processing Units (GPUs)

NOW Handout Page 1. Alternative Model:Vector Processing. EECS 252 Graduate Computer Architecture. Lec10 Vector Processing. DLXV Vector Instructions

Computer Architecture: SIMD and GPUs (Part II) Prof. Onur Mutlu Carnegie Mellon University

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

CS 152 Computer Architecture and Engineering. Lecture 16: Graphics Processing Units (GPUs) John Wawrzynek. EECS, University of California at Berkeley

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S

Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996

CSE 490/590 Computer Architecture Homework 2

COSC 6385 Computer Architecture - Pipelining (II)

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

Updated Exercises by Diana Franklin

Good luck and have fun!

HY425 Lecture 09: Software to exploit ILP

MIPS ISA AND PIPELINING OVERVIEW Appendix A and C

HY425 Lecture 09: Software to exploit ILP

Copyright 2012, Elsevier Inc. All rights reserved.

Instruction-Level Parallelism and Its Exploitation

4DM4 Sample Problems for Midterm Tuesday, Oct. 22, 2013

CS 2410 Mid term (fall 2018)

Announcements. Lecture Vector Processors. Review: Memory Consistency Problem. Review: Multi-core Processors. Readings for this lecture

Computer Architecture and Engineering CS152 Quiz #4 April 11th, 2012 Professor Krste Asanović

Vector Processors. Department of Electrical Engineering Stanford University Lecture 14-1

Pipelining and Vector Processing

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

Advanced Computer Architecture

ILP: Instruction Level Parallelism

Multi-cycle Instructions in the Pipeline (Floating Point)

Four Steps of Speculative Tomasulo cycle 0

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

Transcription:

COSC 6385 Computer Architecture - Vector Processors Spring 011 Vector Processors Chapter F of the 4 th edition (Chapter G of the 3 rd edition) Available in CD attached to the book Anybody having problems to find it should contact me Vector processors big in 70 and 80 Still used today Vector machines: Earth Simulator, NEC SX9, Cray X1 Graphics cards MMX, SSE, SSE, SSE3 are to some extent vector units

Main concepts Vector processors abstract operations on vectors, e.g. replace the following loop for (i=0; i<n; i++) { a[i] = b[i] + c[i]; } by a = b + c; ADDV.D V10, V8, V6 Some languages offer high-level support for these operations (e.g. Fortran90 or newer) Main concepts (II) Advantages of vector instructions A single instruction specifies a great deal of work Since each loop iteration must not contain data dependence to other loop iterations No need to check for data hazards between loop iterations Only one check required between two vector instructions Loop branches eliminated

Basic vector architecture A modern vector processor contains Regular, pipelined scalar units Regular scalar registers Vector units (inventors of pipelining! ) Vector register: can hold a fixed number of entries (e.g. 64) Vector load-store units Comparison MIPS code vs. vector code Example: Y=aX+Y for 64 elements L.D F0, a /* load scalar a*/ DADDIU R4, Rx, #51 /* last address */ L: L.D F, 0(Rx) /* load X(i) */ MUL.D F, F, F0 /* calc. a times X(i)*/ L.D F4, 0(Ry) /* load Y(i) */ ADD.D F4, F4, F /* ax(i) + Y(i) */ S.D F4, 0(Ry) /* store Y(i) */ DADDIU Rx, Rx, #8 /* increment X*/ DADDIU Ry, Ry, #8 /* increment Y */ DSUBU R0, R4, Rx /* compute bound */ BNEZ R0, L

Comparison MIPS code vs. vector code (II) Example: Y=aX+Y for 64 elements L.D F0, a /* load scalar a*/ LV V1, 0(Rx) /* load vector X */ MULVS.D V, V1, F0 /* vector scalar mult*/ LV V3, 0(Ry) /* load vector Y */ ADDV.D V4, V, V3 /* vector add */ SV V4, 0(Ry) /* store vector Y */ Vector execution time Convoy: set of vector instructions that could potentially begin execution in one clock cycle A convoy must not contain structural or data hazards Similar to VLIW Initial assumption: a convoy must complete before another convoy can start execution Chime: time unit to execute a convoy Independent of the vector length A sequence consisting of m convoys executes in m chimes A sequence consisting of m convoys and vector length n takes approximately mxn clock cycles

Example LV V1, 0(Rx) /* load vector X */ MULVS.D V, V1, F0 /* vector scalar mult*/ LV V3, 0(Ry) /* load vector Y */ ADDV.D V4, V, V3 /* vector add */ SV V4, 0(Ry) /* store vector Y */ Convoys of the above code sequence: 1. LV. MULVS.D LV 3. ADDV.D 4. SV 4 convoys 4 chimes to execute Overhead Start-up overhead of a pipeline: how many cycles does it take to fill the pipeline before the first result is available? Unit Start-up Load/store 1 Multiply 7 Add 6 Convoy Starting time First result Last result LV 0 1 11+n MULVS LV 1+n 1+n+1 3+n ADDV 4+n 4+n+6 9+3n SV 30+3n 30+3n+1 41+4n

Pipelining Metrics (I) T c m n S N 1 Clocktime, time to finish one segment/sub-operation number of stages of the pipeline length of the vector Startup time in clocks, time after which the first result is available, S =m length of the loop to achieve half of the maximum speed Assuming a simple loop such as: for (i=0; i<n; i++) { a[i] = b[i] + c[i]; } Pipelining Metrics (II) op op total Number of operations per loop iteration total number of operations for the loop, with op total = op* n Speed of the loop is optotal op* n F = = time T( m+ ( n 1)) For n F = max op T c c we get = T c op m 1 ( + 1) n

Because of the Definition of or and Pipelining Metrics (III) op = m 1 Tc( + 1) N 1 m 1 + 1= N N 1 1 =m 1 1 N 1 F max = we now have length of the loop required to achieve half of the theoretical peak performance of a pipeline is equal to the number of segments (stages) of the pipeline 1 op T c Pipelining Metrics (IV) More general: N α is defined through op op = α m 1 T Tc c( + 1) Nα m 1 and leads to Nα = 1 1 α 3 E.g. for α = you get N3 = 3( m 1) 3m 4 4 the closer you would like to get to the maximum performance of your pipeline, the larger the iteration counter of your loop has to be

Vector length control What happens if the length is not matching the length of the vector registers? A vector-length register (VLR) contains the number of elements used within a vector register Strip mining: split a large loop into loops less or equal the maximum vector length (MVL) Vector length control (II) low =0; VL = (n mod MVL); for (j=0; j < n/mvl; j++ ) { for (i=low; i < VL; i++ ) { Y(i) = a * X(i) + Y(i); } low += VL; VL = MVL; }

Vector stride Memory on vector machines typically organized in multiple banks Allow for independent management of different memory addresses Memory bank time an order of magnitude larger than CPU clock cycle Example: assume 8 memory banks and 6 cycles of memory bank time to deliver a data item Overlapping of multiple data requests by the hardware Cycle Bank 1 0 X(0) Bank 1 Busy X(1) Bank 3 Busy Busy X() Bank 4 3 Busy Busy Busy X(3) Bank 5 4 Busy Busy Busy Busy X(4) Bank 6 5 Busy Busy Busy Busy Busy X(5) Bank 7 6 Busy Busy Busy Busy Busy X(6) Bank 8 7 Busy Busy Busy Busy Busy X(7) 8 X(8) Busy Busy Busy Busy Busy 9 Busy X(9) Busy Busy Busy Busy 10 Busy Busy X(10) Busy Busy Busy 11 Busy Busy Busy X(11) Busy Busy 1 Busy Busy Busy Busy X(1) Busy 13 Busy Busy Busy Busy Busy X(13) 14 Busy Busy Busy Busy Busy X(14)

Vector stride (II) What happens if the code does not access subsequent elements of the vector for (i=0; i<n; i+=) { a[i] = b[i] + c[i]; } Vector load compacts the data items in the vector register (gather) No affect on the execution of the loop You might however use only a subset of the memory banks -> longer load time Worst case: stride is a multiple of the number of memory banks Summary 1: Vector processors Support for operations on vectors using special instructions Each iteration has to be data independent from other iterations Multiple vector instructions are organized in convoy s in the absence of data and structural hazards Each convoy can be executed in one chime So far: assume that a convoy has to finish before another convoy can start Start-up cost of a pipeline Strip-mining costs in case loop iteration count does not match the length of the vector registers

Enhancing Vector Performance Five techniques to further improve the performance Chaining Conditional Execution Support for sparse matrices Multiple lanes Reducing start-up costs by pipelining Chaining Example: MULV.D ADDV.D V1, V, V3 V4, V1, V5 Second instruction has a data dependence on the first instruction: two convoys required Once the element V1(i) is has been calculated, the second instruction could calculate V4(i) no need to wait until all elements of V1 are available could work similarly as forwarding in pipelining Technique is called chaining

Chaining (II) Recent implementations use flexible chaining Vector register file has to be accessible by multiple vector units simultaneously Chaining allows operations to proceed in parallel on separate elements of vectors Operations can be scheduled in the same convoy Reduces the number of chimes Does not reduce the startup-overhead Chaining (III) Example: chained and unchained version of the ADDV.D and MULV.D shown previously for 64 elements Start-up latency for the FP MUL vector unit: 7cycles Start-up latency for FP ADD vector unit: 6 cycles NOTE: different results than in book in fig. G.10 Unchained version: 7 + 63 + 6 + 63 = 139 cycles Chained version: 7 + 6 + 63 = 76 cycles 7 63 6 63 MULV 7 63 MULV 6 63 ADDV ADDV

Conditional execution Consider the following loop for (i=0; i< N; i++ ) { if ( A(i)!= 0 ) { A(i) = A(i) B(i); } } Loop can usually not been vectorized because of the conditional statement Vector-mask control: boolean vector of length MLV to control whether an instruction is executed or not Per element of the vector Conditional execution (II) LV V1, Ra /* load vector A into V1 */ LV V, Rb /* load vector B into V */ L.D F0, #0 /* set F0 to zero */ SNEVS.D V1, F0 /* set VM(i)=1 if V1(i)!=F0 */ SUBV.D V1, V1, V /* sub using vector mask*/ CVM /* clear vector mask to 1 */ SV V1, Ra /* store V1 */

Support for sparse matrices Access of non-zero elements in a sparse matrix often described by A(K(i)) = A(K(i)) + C (M(i)) K(i) and M(i) describe which elements of A and C are nonzero Number of non-zero elements have to match, location not necessarily Gather-operation: take an index vector and fetch the according elements using a base-address Mapping from a non-contiguous to a contiguous representation Scatter-operation: inverse of the gather operation Support for sparse matrices (II) LV Vk, Rk /* load index vector K into V1 */ LVI Va, (Ra+Vk) /* Load vector indexed A(K(i)) */ LV Vm, Rm /* load index vector M into V */ LVI Vc, (Rc+Vm) /* Load vector indexed C(M(i)) */ ADDV.D Va, Va, Vc /* set VM(i)=1 if V1(i)!=F0 */ SVI Va, (Ra+Vk) /* store vector indexed A(K(i)) */ Note: Compiler needs the explicit hint, that each element of K is pointing to a distinct element of A Hardware alternative: a hash table keeping track of the address acquired Start of a new vector iteration (convoy) as soon as an address appears the second time

Multiple Lanes Further performance improvements if multiple functional units can be used for the same vector operation A[9] A[8] A[7] A[6] A[5] A[4] A[3] A[] A[1] B[9] B[8] B[7] B[6] B[5] B[4] B[3] B[] B[1] A[9] B[9] A[5] B[5] A[6] B[6] A[7] B[7] A[8] B[8] A[1] B[1] A[] B[] A[3] B[3] A[4] B[4] C(i) Pipelined instruction startup Start of one vector instruction can overlap with another vector instruction Theoretically: instruction of new vector operation could start in the next cycle after the last instruction of the previous operation Practically: dead time between two instructions, in order to simplify the logic of the pipeline E.g. 4 cycles before the next vector instruction can start

Effectiveness of Compiler Vectorization