Vector an ordered series of scalar quantities a one-dimensional array. Vector Quantity Data Data Data Data Data Data Data Data

Vector Processors A vector processor is a pipelined processor with special instructions designed to keep the (floating point) execution unit pipeline(s) full. These special instructions are vector instructions. Terminology: Scalar a single quantity (number). Vector an ordered series of scalar quantities a one-dimensional array. Scalar Quantity Data Vector Quantity Data Data Data Data Data Data Data Data Five basic types of vector operations: 1. V V Example: Complement all elements 2. S V Examples: Min, Max, Sum 3. V V x V Examples: Vector addition, multiplication, division 4. V V x S Examples: Multiply or add a scalar to a vector 5. S V x V Example: Calculate an element of a matrix One instruction says, in effect, do the same thing on all the elements of the vector(s). Vector Processors Architecture of Parallel Computers Page 1

The generic vector processor: Stream A Multiport Memory System Stream B Pipelined Processor Stream C = A x B Many large-scale scientific and engineering problems can be solved by operations on large vectors or matrices of floating point numbers. Vector processors are designed to efficiently work on these problems. Performance of these machines is measured in: FLOPS Floating Point Operations per Second, MegaFLOPS a million FLOPS, or GigaFLOPS a billion FLOPS. The extremely high performance is achieved only for problems that can be expressed as operations on large vectors. These processors are also called supercomputers, popularized by the CRAY series. The cost/performance ratio of vector processors can be impressive, but the initial cost is high (few of them are built). NEC's SX-4 series, which NEC claims was the most successful supercomputer, sold just 134 systems in 3 years. NEC reports that the SX-5, introduced in June 1998, has received orders for 22 systems over the last year. We also see the attached vector processor an optional vector processing unit attached to a standard scalar computer. 1997, 1999 E.F. Gehringer, G.Q. Kenney CSC 506, Summer 1999 Page 2

Matrix multiplication: Suppose we want to calculate the product of two N x N matrices, C := A x B. We must perform this calculation: c ij := N 1 k = 0 a ik b kj Inner loop of a scalar processor performing the martix multiply The following loop calculates a single element of the matrix C. We must execute this loop N 2 times to get A x B: kloop ------- ; Instructions to initialize 1 iteration of kloop ------- ; (initialize RC, RN, Rk, Ri, Rj) ADD Ri, Stride-i ; Increment column of A ADD Rj, Stride-j ; Increment row of B LOAD RA, A(Ri) ; Get value of Matrix A row LOAD RB, B(Rj) ; Get value of Matrix B column FMPY RA, RB ; Floating multiply FADD RC, RA ; Floating add INC Rk ; Increment k CMP Rk, RN ; At end of Row x Column? BNE kloop ; No -- Repeat for R x C STORE RC, C(r, c) ; Yes -- Store C element ------- ; Continue with all Rows/Columns of C Vector Processors Architecture of Parallel Computers Page 3

Vector Processor Operation With a vector processor, we have minimal instructions to set up the vector operation, and the entire inner loop (kloop) consists of three vector instructions: ------- ; Instructions to initialize vector operation VLOAD V1, A(r), N, Stride-i ; Vector load row of A with stride i VLOAD VMPYADD STORE V2, B(c), N, Stride-j V1, V2, RC RC, C(r, c) ; Vector load column of B with stride j ; Vector multiply + add to C ; Store C element ------- ; Continue with all Rows/Columns of C The special vector instruction allows us to calculate each element of C in a single vector floating point instruction (VMPYADD) rather than 2N scalar floating point instructions (FMPY and FADD) and 5N loop control and addressing instructions. In addition, the special vector instruction can keep the floating point pipeline full and generate one result output per clock. For example, if we have a 4-stage floating point addition pipe and a 10-stage floating point multiply pipe: Do we ever get more than one instruction in the pipelines at a time with the kloop sequence of the scalar processor? We will keep both pipelines full with successive multiply/adds on the vector processor. With P independent pipes, we can operate on P elements of C in parallel. 1997, 1999 E.F. Gehringer, G.Q. Kenney CSC 506, Summer 1999 Page 4

FORTRAN is still the preferred language for the majority of the users of vector processors, because the majority of users are scientists and engineers and because there is a large amount of scientific software available in FORTRAN. Example FORTRAN: DO 100 I=1,N A(I) = B(I) + C(I) B(I) = 2 * A(I+1) 100 CONTINUE If we unwind this DO Loop A(1) = B(1) + C(1) B(1) = 2 * A(2) A(2) = B(2) + C(2) B(2) = 2 * A(3)... Vector FORTRAN TEMP(1:N) = A(2:N+1) A(1:N) = B(1:N) + C(1:N) B(1:N) = 2 * TEMP(1:N) Also, some optimizing FORTRAN compilers automatically generate vector code from the original DO Loop. For example, DEC VAX FORTRAN supports the automatic generation of vector operations. [NO]VECTOR Controls whether or not the compiler checks the source code for data dependencies and generates code for the vector hardware when the code is eligible. Vector Processors Architecture of Parallel Computers Page 5

An example vector processor: NEC announced the SX-4 supercomputer in November 1994. It is the third in the SX series of supercomputers and is upward compatible from the SX- 3R vector processor with enhancements for scalar processing, short vector processing, and parallel processing. The SX-4 has an 8.0 ns clock cycle and a peak performance of 2 Gflops per processor. Each SX-4 processor contains a vector unit and superscalar unit. The vector unit is built using eight vector pipeline processor VLSI chips. Each vector unit chip is a self contained vector unit with registers holding 32 vector elements. The eight chips are connected by crossbar and comprise 32 vector pipelines arranged as sets of eight add/shift, eight multiply, eight divide, and eight logical pipes. Each set of eight pipes serves a single vector instruction, and all sets of pipes can operate concurrently. With a vector add and vector multiply operating concurrently, the pipes provide 2 GFLOPS peak performance. The memory and the processors within each SX-4 node are connected by a nonblocking crossbar. Each processor has a 16 Gbytes per second port into the crossbar. The main memory can have up to 1024 banks of 64-bit wide synchronous static RAM (SSRAM). The SSRAM is composed of 4 Mbit, 15 ns components. Bank cycle time is only two clocks. (Note: NEC has subsequently changed to use Synchronous Dynamic RAM (SDRAM) instead of static RAM). A 32 processor node has a 512 gigabytes per second sustainable memory bandwidth. Conflict free unit stride as well as stride 2 access is guaranteed from all 32 processors simultaneously. Higher strides and list vector access benefit from the very short bank cycle time. Note: The SX-4 achieves the stated 2 GFLOPS by feeding a multiply directly into an add, and concurrently doing this on 8 parallel pipelines. 8 ns per clock = 125 MHz. 125 MHz x 2 FLOPS/clock x 8 pipes = 2 GFLOPS. 1997, 1999 E.F. Gehringer, G.Q. Kenney CSC 506, Summer 1999 Page 6

NEC SX-5 Organization CPU- Central Processing Unit MM- Main Memory Unit IOP - Input-Output Processor VR - Vector Register File SR - Scalar Register File The SX-5 Series employs a 0.25µ CMOS LSI technology. This enables the SX-5 to achieve a clock cycle of 4.0ns, which is half that of the SX-4 Series. Vector Processors Architecture of Parallel Computers Page 7

SX-4BA Server SX-4A Single Node SX-4AM Multi Node CPUs 1-4 1-16 8-256 CPU Peak 1.8 GF 2 GF 2 GF System Peak 7.2 GF 32 GF 512 GF Clock 8.8 ns 8.0 ns 8.0 ns Memory Type SDRAM SDRAM SDRAM Max.Capacity 16 GB 32 GB 512 GB Max Banking 4,096 8,192 131,072 IOP (max) 1.6 GB/s 3.2 GB/s 25.6 GB/s XMU Optional Optional Optional Max Bandwidth 3.6 GB/s 8 GB/s 128 GB/s Max.Capacity 8 GB 16 GB 64 GB Table 1: SX-4A Models Overview SX-4 Vector Unit Substantial effort has been made to provide significant vector performance for short vector lengths. The crossover between scalar and vector performance is a short 8 elements in most cases. The vector unit has 8 operational registers from which all operations can be started. In addition, there are 64 vector data registers which have a subset of valid instructions and that can receive results from pipelines concurrently with the 8 operational registers; the vector data registers serve as a high performance vector cache which significantly reduces memory traffic in most cases. The ganging of 8 vector pipeline processor VLSI results in visible vector registers which each hold 256 vector elements. Therefore the vector unit is described as 72 registers of 256 elements of 64 bits each. 1997, 1999 E.F. Gehringer, G.Q. Kenney CSC 506, Summer 1999 Page 8

Revisit the definition of speedup Recall that the speedup of a pipeline measures how much more quickly a workload is completed by the pipeline processor than by a non-pipeline processor. Speedup = Best Parallel Serial Time Execution Time A k-stage pipeline with all stages of equal duration (one clock period) has a theoretical speedup of k because it takes k clocks to get a single operation through the pipe and we are retiring one operation every clock. Vector Processors Architecture of Parallel Computers Page 9

We will now look at the actual speedup of a pipeline in a vector processor considering how full we can keep it. Several tasks (operations on the elements of a vector) may be simultaneously active in a pipeline. Space (pipeline stages) S 4 S 3 S 2 T 1 2 T 1 3 T 2 2 T 1 4 T 2 3 T 3 2 T 2 4 T 3 3 T 4 2 T 3 4 T 4 3 T 5 2 T 4 4 T 5 3 T 5 4 S 1 T 1 T 2 T 3 T 4 T 5 1 1 1 1 1 0 1 2 3 4 5 6 7 8 Time (pipeline cycles) Suppose there are: k stages in the pipeline, and n tasks to be executed. We have n 1 clocks where the pipeline is not full (startup at the beginning and empty out at the end). So, the speedup S(k) that is achieved when we account for the time it takes to fill the pipeline is given by: S( k) = k nk + ( n 1) As n (number of tasks) approaches infinity, the speedup approaches k (number of stages). Therefore, short vectors get little speedup and long vectors approach maximum speedup. 1997, 1999 E.F. Gehringer, G.Q. Kenney CSC 506, Summer 1999 Page 10

It may be possible to partially overlap finishing one vector operation with starting up another vector operation. A timing diagram would look like this: Space Time Vector instructions must be able to specify the stride for a vector. The elements of a vector may not be stored in consecutive memory locations. For example, in our N x N matrix multiplication, vector A has a stride of 1 (the row) and vector B has a stride of N (the column). A constant stride may be specified such as every other (stride = 2), or every third (stride=3), etc., vector element can be loaded or stored. Many problems involve sparse matrices where the stride is random. In such cases, gather/scatter instructions are used to load and store data under the control of a vector register that contains a pointer to the location of the needed data indirect addressing. An arithmetic operation need not be performed on every element of the vector. In such a case, a mask register is constructed that controls which elements of a vector are loaded, operated on, and stored. Assuming that we get all of the pipeline and logical operations worked out, the main problem with vector processors is feeding them. How much memory bandwidth do we need to feed an SX-4 processor with 64-bit operands? Vector Processors Architecture of Parallel Computers Page 11

If we had to feed the pipeline directly from interleaved memory as Stone shows in figure 5.4: Stage 4 Stage 3 Stage 2 Stage 1 Mem. mod. 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 RB5 RB5 RA7 RA7 W3 W3 Mem. mod. 6 RB4 RB4 RA6 RA6 W2 W2 Mem. mod. 5 RB3 RB3 RA5 RA5 W1 W1 Mem. mod. 4 RB2 RB2 RA4 RA4 W0 W0 Mem. mod. 3 RB1 RB1 RA3 RA3 Mem. mod. 2 RB0 RB0 RA2 RA2 W6 Mem. mod. 1 RA1 RA1 RB7 RB7 W5 W5 Mem. mod. 0 RA0 RA0 RB6 RB6 W4 W4 0 1 2 3 4 5 6 7 8 9 10 11 12 Time (clock periods) The pipeline is running at 8 ns per clock and each operand is given two clocks, so the memory modules must each have an access time of 16 ns. This is a reasonable SRAM access time. Problems: Three of these modules need to transfer their 64-bit data words concurrently to/from the processor pipeline on every clock, requiring three 125 MHz busses into the processor, similar to figure 5.2 in Stone. The three vectors must be stored in the modules as in figure 5.3 such that the access to the memory modules is perfectly synchronized. 1997, 1999 E.F. Gehringer, G.Q. Kenney CSC 506, Summer 1999 Page 12

Back to Interleaved Memory How can we organize memory to provide sequential access faster than any one module cycle time? Recall that interleaved memory places consecutive words of memory in different memory modules: Memory Module 0 Memory Module 1 Memory Module 2 Memory Module 3 Words with addresses = 0 (mod 4) Words with addresses = 1 (mod 4) Words with addresses = 2 (mod 4) Words with addresses = 3 (mod 4) Since a read or write to one module can be started before a read/write to another module finishes, reads/writes can be overlapped. Only the leading bits of the address are used to determine the address within the module. The least-significant bits (in the diagram above, the two least-significant bits) determine the memory module. Thus, by loading a single address into the memory-address register (MAR) and saying read or write, the processor can read/write M words of memory. We say that memory is M-way interleaved. Low-order interleaving distributes the addresses so that consecutive addresses are located within consecutive modules. For example, for 8- way interleaving: 0 8 16 24 1 9 17 25 2 10 18 26 3 11 19 27 4 12 20 28 5 13 21 29 6 14 22 30 7 15 23 31 Vector Processors Architecture of Parallel Computers Page 13

Interleaved-memory designs: Interleaved memory divides an address into two portions: one selects the module, and the other selects an address within the module. Each module has a separate MAR and a separate MDR. When an address is presented, a decoder determines which MAR should be loaded with this address. It uses the low-order m log 2 M bits to decide this. The high-order n m bits are actually loaded into the MAR. They select the proper location within the module. Address within memory module Memory module Address from CPU n m bits m bits Address bus n m m MAR MAR MAR Memory Memory Memory unit unit unit 0 1 2 m 1 2 m 1 1 Decoder MDR MDR MDR 0 Data bus 1997, 1999 E.F. Gehringer, G.Q. Kenney CSC 506, Summer 1999 Page 14

An alternative to feeding a vector processor directly from external storage is to provide a hierarchical memory system similar to cache memory. Memory on the processor chip is called register storage rather than L1 cache, and is managed directly by the programmer rather than automatically by the hardware. A vector processor with high-speed register storage: Main memory Vector load/store FP add/sub. FP multiply Vector registers FP divide Integer Boolean Scalar registers The vector registers are large 64 to 256 floating point numbers each. 256 floating point numbers at 64 bits each times 8 registers is equivalent to a 16k byte internal data cache. Vector Processors Architecture of Parallel Computers Page 15

Masking If statements in loops get in the way of vector processors. For example, consider an operation on a vector where you want to do something if the element is not 0. You might code it as the following loop for a scalar processor: for i := 1 to n do if A[i] 0 then A[i] := A[i] - B[i]; This does not work well with a vector processor. We would like to specify an operation on the entire vector A. A vector mask register (VM) holds a boolean vector that can be set to specify if the operation on the corresponding vector element should be performed. The operation on the vector element takes place only of the corresponding mask bit in the VM is 1. For example, the following sequence could be used with the mask register: VLOAD V1, A, N, Stride-i ; Vector load row of A with stride i VLOAD SLOAD V2, B, N, Stride-j S0, #0 ; Vector load column of B with stride j ; Scalar floating point constant 0 VMSNE S0, V1 ; Sets VM bit to 0 if V1[i] = S0 VSUB VMC V1, V2 ; Vector subtract V2 from V1 ; Clear vector mask to all 1 STORE A, V1 ; Store vector A 1997, 1999 E.F. Gehringer, G.Q. Kenney CSC 506, Summer 1999 Page 16