CMPE 655 Multiple Processor Systems. SIMD/Vector Machines. Daniel Terrance Stephen Charles Rajkumar Ramadoss

Size: px

Start display at page:

Download "CMPE 655 Multiple Processor Systems. SIMD/Vector Machines. Daniel Terrance Stephen Charles Rajkumar Ramadoss"

Branden Henry
5 years ago
Views:

1 CMPE 655 Multiple Processor Systems SIMD/Vector Machines Daniel Terrance Stephen Charles Rajkumar Ramadoss

2 SIMD Machines - Introduction Computers with an array of multiple processing elements (PE). Similar operations are performed in parallel on each element of data structure (Data-Level-Parallelism). All PEs are synchronized to a single program counter (PC). Responds to a single instruction in a given cycle. Each PE has separate address registers. Application Image Processing 2

3 SIMD Pros & Cons Pros Useful in applications where same operation has to be performed on a array of data. Ex: Loop Operation Cost of Control Unit is brought down over dozens of Functional Units Reduced instruction bandwidth and space. Cons Large register files which increases power consumption and chip size. Currently, implementation of SIMD instructions requires human labor. NOT useful in applications where different operations needs to performed. Ex: Switch Case Statements 3

4 SIMD Interpretations Two SIMD architecture interpretations are very popular in today s world. SIMD Multimedia Extensions in x86 ISA Addition of SIMD instructions to the x86 architecture for high graphics and digital signal processing. Ex: Intel s Streaming SIMD Extensions (SSE), AMD s 3DNow! Vector Architecture Have large set of registers to store data elements Uses pipelined execution unit to operate on the data elements sequentially 4

5 Vector Architecture In Detail A vector processor contains a vector unit and an ordinary scalar unit. Vector units have functional units that operate with several clock cycles latency. Can be pipelined more deeply due to: short clock cycle time compatible with vector operations that runs for a longer period of time. 5

Vector Architecture Types Vector-Register Architecture Operations are between vector registers (except load and store operation) Ex: Cray-1,Cray-2,

6 Vector Architecture Types Vector-Register Architecture Operations are between vector registers (except load and store operation) Ex: Cray-1,Cray-2, Fujitsu VP200 through VP5000, Hitachi S820 and Convex C-1 through C4 Memory-Memory Architecture Operations are memory to memory. Ex: TI ASC and CDC STAR-100 6

7 Vector Register Processor 7

8 Vector Registers Each vector register holds a single vector VMIPS has 8 vector registers and each register holds bit elements. Each vector register has 2 read ports and 1 write port in VMIPS. In total, 16 read ports and 8 write ports. Real machines tries to reduce the vector-register file cost with the use of regular access patterns within a vector instruction. Ex: Cray-1 has only one port per register. 8

9 Vector Functional Units The functional units are fully pipelined Starts new operation on every clock cycle. VMIPS have 5 functional units as shown. Scalar operations may use vector functional units or use a dedicated set. 9

10 Vector Load/Store Unit (LSU) & Scalar Registers LSU: Loads or stores a vector to or from memory LSU in VMIPS is fully pipelined Also handles scalar loads and stores operation Scalar Registers: Provides input to the vector functional unit (as one input). Computes addresses to pass to the vector LSU In MIPS architecture, 32 general purpose registers and 32 floating point registers are present. 10

11 Reduction of Instruction Bandwidth Example MIPS Code: Vector MIPS Code: 11

12 Vector Processors - Issues What if the size of the vector is greater than the size of the vector registers? Vector-length register is used when size of the vector is less than the size of the vector register. Strip Mining is implemented when size of the vector is greater than the size of the vector register. What if non adjacent elements has to be fetched from the memory? Vector Stride. 12

13 Strip Mining Generation of code to carry out vector operations on leftover data. low = 1 VL = (n mod MVL) /*find the odd-size piece*/ do 1 j = 0,(n / MVL) /*outer loop*/ 10 continue 1 continue do 10 i = low, low + VL - 1 /*runs for length VL*/ Y(i) = a * X(i) + Y(i) /*main operation*/ low = low + VL /*start of next vector*/ VL = MVL /*reset the length to max*/ MVL - Maximum vector length (Size of the Vector Register). 13

14 Vector Stride Stride is the distance that separates the elements that are to be gathered in a single register. Elements that are loaded in the vector register acts as if it is logically arranged in adjacent locations. Efficiently used in Vector Multiplications where the array that is stored in memory is linearized. 14

15 More Issues... Presence of Conditional statements in loops Vector-mask control Use of Sparse Matrices Scatter Gather Operations 15

16 Vector Mask Control Basic Idea: Convert the control dependencies to data dependencies. Method: A Boolean vector register called vector-mask register is enabled where the vector operations are performed when the corresponding element in the vector-mask register is 1. Resetting the vector-mask register sets all the entries to 1 making the vector instruction to perform the operation on every element of the array. 16

17 Vector Mask Control Example of a loop with conditional statement. The following code implements a vector mask register and does the operation based on the values in vector mask. 17

18 Scatter and Gather The sparse matrix can be represented by a bit vector(normal representation that includes zeros) and a dense representation(excluding zeros). Scatter and gather works on the ideology of moving through normal and dense representation. Gather operation produces the dense vector, where the arithmetic operations are performed. Scatter operation stores the operand and result vector back to the normal form. 18

19 Scatter and Gather The commands LVI and SVI (below) supports the scatter and gather operation in VMIPS. The code provided below implements the vector addition of two sparse matrices by performing scatter and gather operations. 19

20 Multiple Lanes Vector performance is improved by implementing parallel pipelines. 20

21 Multiple Lanes The ISA of every vector processor is designed such a way that n th element of one vector can participate only with the n th element of any other vector. This simplifies the structure of the parallel pipelined units into multiple parallel lanes. The source and destination operands of every operation will be available within their corresponding lanes. It cuts the need of inter-lane communication which is a good thing. However, inter-lane communication is required to access main memory. 21

22 Structure of a vector unit with four multiple lanes. 22

23 Some Vector Machines Cray Supercomputers Fujitsu VP Series Supercomputers Hitachi S Series Supercomputers Convex Supercomputers 23

24 Architecture of Cray 1A Supercomputer 24

25 Organization of Cray T3D Supercomputer 25

26 Cray T3D Node Every routing switch connects two processing elements(pe) to the network. The alpha processor in every PE was designed by Digital Equipment Corporation(DEC) where DEC is a part of HP. 26

27 Cray T3D - Synchronization Mechanisms Barrier synchronization Used when all the processes must reach a common point before proceeding further Eurekha Synchronization Used in search tasks where the search is terminated if one the processes has found the element being searched. 27

28 Performance on Cray T3D 28

29 Characteristics of Several Vector Machines Processor Clock Rate (MHz) Vector Registers Elements per Register Vector Load/Store Units Lanes Cray Cray Fujitsu VP100/VP200 Hitachi S810/S (VP100) 2 (VP200) Loads 1 Store 1 (S810) 2 (S820) Convex C (64 bit). 2 (32 bit) Cray SV VMIPS

30 Conclusion Vector Processors finds its application in modern day GPUs and other image processing environments. Vector Processors proves to be more expensive, Unlike scalar processors. Although Vector Processors is not widely popular today, it still represents a milestone in supercomputing world. Vector processors will continue to have a future, however, it cannot attain the popularity of scalar microprocessors. 30

31 References David A. Patterson, John L. Hennessy Computer Organization and Design The Hardware/Software Interface - Fifth Edition Vector Processors Cray T3D Performance Cray T3D 31

32 QUESTIONS??? 32

33 THANK YOU! 33

Data-Level Parallelism in SIMD and Vector Architectures. Advanced Computer Architectures, Laura Pozzi & Cristina Silvano

Data-Level Parallelism in SIMD and Vector Architectures Advanced Computer Architectures, Laura Pozzi & Cristina Silvano 1 Current Trends in Architecture Cannot continue to leverage Instruction-Level parallelism