University of Waterloo Faculty of Engineering Department of Electrical and Computer Engineering. xstream Processor. Group #057

University of Waterloo Faculty of Engineering Department of Electrical and Computer Engineering xstream Processor Group #057 Consultant: Dr. William Bishop Andrew Clinton (00084747) Sherman Braganza (00096130) Alex Wong (00094660) Asad Munshi (00168999) Jan 14, 2005

Abstract Stream Processing is a data processing paradigm in which long sequences of homogeneous data records are passed through one or more computational kernels to produce sequences of processed output data. Applications that fit this model include polygon rendering (computer graphics), matrix multiplication (scientific computation), 2D convolution (media processing), and encryption. Computers that exploit stream computations are able to process data much faster than conventional microprocessors because they have a memory system and execution model that permits high on-chip bandwidth and high arithmetic intensity (operations performed per memory access). We have designed a general-purpose, parameterizable, SIMD stream processor that operates on 32-bit IEEE floating point data. The system is implemented in VHDL, and consists of a configurable FPU, execution unit array, and memory interface. The FPU supports fully pipelined operations for multiplication, addition, division, and square root, with a configurable data width. The execution array operates in lock-step with an instruction controller, which issues 32-bit instructions to the execution array. To exploit stream parallelism, we have exposed parameters to choose the number of execution units as well as the number of interleaved threads at the time the system is compiled. The memory system allows all execution units to access one element of data from memory in every clock cycle. All memory accesses also pass through an inter-unit routing network, supporting conditional reads and writes of stream data. We have performed clock cycle simulation of our design using various benchmark programs, as well as synthesis to Altera FPGA to verify component complexity. The eventual goal for this project is for the design to be synthesizable on a large FPGA or ASIC. ii

Acknowledgements We would like to thank our supervisors Prof. William Bishop and Prof. Michael McCool who have helped us to understand the complex field of computer hardware design our project would not have been possible with their kind assistance. We also wish to thank the creators of the Symphony EDA VHDL simulation tools that have made it possible for us to develop our system extremely rapidly at mimimal cost. iii

Table of Contents Abstract... ii Acknowledgements...iii List of Figures... vi List of Tables... vii 1. Introduction... 1 2. High Level Design... 3 2.1 Stream Processing on the xstream Processor... 4 3. Detailed Design... 5 3.1 Floating Point Unit... 5 3.1.1 Hardware Modules... 6 3.1.2 Denormalizing... 9 3.1.3 Addition... 10 3.1.4 Multiplication... 10 3.1.5 Division... 11 3.1.6 Square Root... 12 3.1.7 Rounding and Normalizing... 12 3.1.8 Integer and Fraction Extraction... 13 3.1.9 Conditional Floating Point Logic Comparator... 14 3.1.10 Integer arithmetic unit... 15 3.1.11 Synthesis... 16 3.2 Instruction Set Architecture... 16 3.2.1 Word Size... 17 3.2.2 Bus Structure... 18 3.2.3 Storage Structure... 18 3.2.4 Register Windowing... 19 3.2.5 Pipeline... 20 3.2.6 Writeback Queue... 21 3.2.7 Multithreading... 22 iv

3.2.8 Conditionals... 23 3.2.9 Instruction Set... 24 3.2.10 Instruction Controller... 27 3.3 Memory System... 27 3.3.1 Channel Descriptors... 28 3.3.2 Routing interface... 29 3.4 Routing Network... 30 3.4.1 Barrel Shifter... 30 3.4.2 Compactor... 31 3.5 Assembler... 32 3.6 External Memory... 33 3.6.1 Overview... 33 3.6.2 SDRAM Model... 33 3.6.3 Memory Controller... 35 4. Applications... 38 4.1 Normalize... 38 4.2 Fractal Generation... 38 4.3 Rasterization... 39 5. Discussion and Conclusions... 40 5.1 Future Possibilities... 40 5.1.1 FPU Optimizations... 40 5.1.2 Synthesis... 41 5.1.3 Off-chip DRAM... 41 References...42 v

List of Figures Figure 1: xstream Processor Block Diagram... 3 Figure 2: Component depiction of FPU... 7 Figure 3: Diagram depicting four stage pipeline used in FPU... 8 Figure 4: Denormalize Component... 9 Figure 5: Addition Component... 10 Figure 6: Multiplication Component... 11 Figure 7: Division Component... 11 Figure 8: Square root Component... 12 Figure 9: Rounding and Normalizing Component... 13 Figure 10: Integer and Fraction Extraction Component... 14 Figure 11: Comparator Component... 15 Figure 12: Integer Arithmetic Unit Component... 16 Figure 13: Execution Unit Component Diagram... 17 Figure 14: Execution Controller Component Diagram... 17 Figure 15: Interleaved Multithreading... 23 Figure 16: Description of channels in SRAM... 29 Figure 17: Barrel Shifter network... 31 Figure 18: Compactor network... 32 Figure 19: An overview of the external memory architecture... 33 Figure 20: SDRAM Component... 34 Figure 21: Memory Controller Component... 35 Figure 22: Memory Controller State Diagram... 36 Figure 23: Fractal image generated by simulation... 39 vi

List of Tables Table 1: Functional Unit Latencies... 20 Table 2: Instruction Set... 24 Table 3: Functional Unit Identification... 25 Table 4: General instruction format... 26 Table 5: Conditional Register format... 26 Table 6: Condition Codes... 26 Table 7: Command Listings... 34 Table 8: Description of states in memory controller diagram... 37 vii

1. Introduction Many applications of computers contain a high degree of inherent data parallelism. Such applications include graphics rendering, media processing, encryption, and image processing algorithms each of which permit operations on different data elements to proceed in parallel. A fundamental problem of computer hardware design is to expose as much of this parallelism as possible without compromising the desired generality of the system. One such approach is stream processing, in which data parallelism is exposed by processing data elements independently and in parallel. In stream processing, each record in a stream is operated on independently by a kernel a small program applied to each data element, potentially in parallel. The result of applying a kernel to a stream produces another stream. Significantly, the order in which streams and kernels are applied is very flexible, allowing the order in which operations are performed to be optimized. In this way, stream processing can be implemented in a way that minimizes resource use in a hardware design, allowing more computations to be performed using less area in a VLSI design and less memory bandwidth. Most past research in stream processing has been pursued at Stanford University in the development of the Imagine stream processor ([1], [2], [3], [4], [5]). The Imagine is a VLIW stream processor developed primarily for media processing, with a prototype ASIC implementation. The Imagine supports a 3-level memory hierarchy including offchip RAM, a high-bandwidth stream register file, and high-speed local registers contained in each processing element. Research at Stanford has investigated the implementation of conditional streams [3], in which conditionals are implemented in a SIMD architecture through multiple passes and on-chip routing. Research has also investigated low-level (kernel) scheduling for VLIW instructions and high-level (stream control) scheduling [5] for placing and moving streams within the stream register file. Research has also been pursued in configurable parallel processors in Texas [6], for the TRIPS processor. 1

Parallel processing has also undergone rapid development in the area of graphics processors (GPUs). The PixelFlow [7] system was one of the first to take advantage of parallel processing over rendered pixels. Current-generation GPUs now support highly flexible instructions sets for manipulating pixel shading information, while exploiting fragment parallelism. Hardware features have even been proposed to make GPUs more like stream processors ([8], [9], [10]). Many researchers have recognized the potential performance exposed by GPUs and have ported a wide range of parallelizable algorithms to the GPU, including linear algebra operations, particle simulations, and ray tracing ([11], [12]). The introduction of general purpose GPU programming has led to the development of several high-level tools for harnessing the power of these devices. These attempts include the development of Brook for GPUs [13] (a project from Stanford, aimed at exposing the GPU as a general stream processor) and Sh ([14], [15], [16]) (a project at the University of Waterloo, which targets GPUs from ordinary host-based C++ code an approach called metaprogramming. Our project aims to merge results in these two areas by developing a stream processor that is compatible with existing API s for GPU programming. In particular, our device allows most of the Sh system to be executed in hardware, including single precision floating point support. Our system also supports an inter-unit routing system allowing us to support conditional streams. We have designed our system to be efficient, modular, and simple to program. 2

2. High Level Design A high-level block diagram of the processor is shown in Figure 1. There are several key components of the design, including a floating point unit (embedded within each of the processing elements), the execution units themselves, and an on-chip memory system consisting of an inter-unit routing network and fast stream register file. The DRAM and host are components that would be integrated eventually for a full design. Figure 1: xstream Processor Block Diagram 3

2.1 Stream Processing on the xstream Processor The design implements the stream processing model by enforcing a restricted memory access model. All memory access from the execution unit array must pass through the stream register file, and all memory access must read or write at the head of the stream. The stream register file stores streams as a sequence of channels each channel stores a component of a stream record. The channels are configured at startup of the device, and are dynamically read and written during execution. Data parallelism is exploited in two major ways in the design. First, data elements can be processed in parallel by an array of processing elements. This results in a speedup of roughly N times for an array containing N processors, as there is little communication overhead for independent stream accesses. As the number of processors is increased, the memory and routing network automatically scale to match the new array size (with the routing only rising as N*log(N)). Second, interleaved multithreading is used within each processing element with an arbitrary number of threads, trading instruction level parallelism for the available (unlimited) data parallelism. The execution model allows for conditional execution of read or write instructions to the memory bus. The system automatically routes stream data to the correct memory bank when processors are masked. Execution units are fully pipelined so (with multithreading) a throughput of 1 instruction per cycle is almost always achieved. 4

3. Detailed Design This section describes in detail each of the components in the system. This section constructs the entire design starting with the low-level components (floating point) and proceeding through to the execution unit and instruction set, and finally to the memory system and routing network. 3.1 Floating Point Unit The floating point arithmetic unit detailed in this document is based on a core parameterized floating point library developed at the University Rapid Prototyping Lab (RPL) [17]. A number of extensions were made to the existing core library to provide additional arithmetic operations such as floating point division and square root operations. Furthermore, a parameterized floating point arithmetic unit interface was created to provide a single core component for general interfacing with the rest of the processor architecture. In contrast to developing the floating point unit from the ground up, it was decided that it was more efficient to build upon an existing core and add the functionality that is needed for the project. The following floating point libraries were evaluated: RPL library of variable floating point modules Single Precision Floating Point Unit Core Project from Portland State University FPLibrary library of floating point modules from Arénaire project FPHDL Floating point library by Professor David Bishop The FPLibrary implementation was discarded early on due to the high complexity of the design, making it virtually incomprehensible for use in our project. The implementation from Portland State was also discarded early on for a number of reasons. First of all, the synthesized performance of the unit was below that of the RPL and FPHDL 5

implementations. Secondly, the core was hard-coded to perform only single precision floating point arithmetic and so provides a constraint on future expansion to double precision arithmetic calculations. Most importantly, the unit failed to provide correct results when calculating simple test values. Between the RPL library and the FPHDL implementation, it was decided that the RPL library was most appropriate for our purposes for a number of reasons. First of all, the FPHDL implementation is still in a relatively early stage and so stability and reliability issues may arise from using it. Secondly, the way that the VHDL code was structured was very generic and not very hardware-centric. Therefore, the operations were not pipelined and not area conscious and would require a high-end synthesis tool such as Synopsys to generate hardware that is optimized for performance. The RPL library, on the other hand, was designed to be very hardware-oriented and provided a fully pipelined structure that was efficient and parameterized. 3.1.1 Hardware Modules The FPU core used in the project has the following features: Implements Single Precision (32-bit) Implements Floating point addition, subtraction (by negation), multiplication, division and square root Implements two of the four IEEE standard rounding modes: round to nearest and round to zero Exceptions are partially implemented and reported according to the IEEE standard The above specification was decided based on a number of important design decisions. For the purpose of the project, the parameterized floating point core unit was configured to comply with the IEEE 754 standard for single precision floating point representation. Therefore, the number of mantissa bits and exponent bits are set to 23 bits and 8 bits respectively, with a single sign bit. This would provide us with the level of precision 6

needed to perform basic floating point calculations, while still being manageable in terms of size and testing effort. Operations on denormalized floating point numbers are not supported because, based on research done by Professor Miriam Leeser at the Rapid Prototyping Lab, it would require a great deal of extra hardware to implement and therefore is not necessary for our purposes. The above list of floating point arithmetic operations was chosen as it provides the functionality needed for a subset of the Sh API and upon which all other operations can be built upon. Only two of the four IEEE standard rounding modes were implemented as the additional rounding modes (+inf and -inf) were deemed unnecessary for our purposes. Finally, only part of the exception handling specified by IEEE standard was implemented in the FPU core. The inexact exception is ignored while all other exceptions generate zero values at the output of the FPU along with a general exception flag. This was chosen to provide compatibility with the Sh API. Figure 2: Component depiction of FPU The unit takes in two such 32-bit floats as input. In order for the units to begin execution, the ready line must also be set to high. An extra line has been provided for exception input to handle the case where multiple FPU's are chained together. There is a clock line and a mode input that selects the current instruction as listing in the table above. The output is written to the OUT1 line as a 32 bit standard single precision float. In addition, in the event of a divide by zero or the square root of a negative number being taken, the exception output line is also driven high. Finally, there is a line to indicate when the 7

current instruction is completed, although this may prove unnecessary in the future. To set the floating point operation to perform on the FPU, the MODE bits are set in the following manner: 1) Add (Mode lines set to 0000) 2) Multiply (Mode lines set to 0010) 3) Divide (Mode lines set to 0011) 4) Square Root (Mode lines set to 0100) The Mode input is translated into an internal one-hot encoding scheme for selecting the appropriate arithmetic units to enable. The FPU has been designed to operate using a six-stage pipeline. The data flow of the unit is presented below. Round Operand A Operand B Operation Select De-normalize into 33 bit floats FP Operation Addition (4 stages) Multiplication (4 stages) Division (4 stages) Square Root (only accepts Operand A) (4 stages) Round and normalize into IEEE floating point representation (2 stages) Exception Generator 32-bit result Exception out Figure 3: Diagram depicting four stage pipeline used in FPU 8

The first four stages involve the denormalizing of the floating point numbers and the arithmetic operation. The last two stages involves the rounding and normalizing of the floating point value. The names and functions of the hardware modules used by the unit are listed below: Denorm (denormalize floating point numbers) rnd_norm (normalizing and rounding results) fp_add (perform addition on two floats) fp_mul (perform multiplication on two floats) fp_div (perform division on floats) fp_sqrt (generate the square root of a float) 3.1.2 Denormalizing Figure 4: Denormalize Component In the normalized format of a floating point number, a single non-zero bit is used to represent the integer in the mantissa. However, in the IEEE 754 normalized binary representation, the non-zero bit is discarded to save space as it is implied to be there. However, this implied 1 is required to perform proper arithmetic operations. Therefore, the task of the denormalizing module is to add the implied 1 value back into the binary representation. The output of the module is 1 bit greater than the input value. The denormalizing module is asynchronous and purely combinational. 9

3.1.3 Addition In the addition unit, the floating point addition is performed in the following way: 1) align the mantissas of the input values 2) perform integer addition on the mantissas 3) shift resultant mantissa to the right by one bit and increment exponent by one if overflow occurs 4) merge the new mantissa and exponent components back into a proper floating point representation Subtraction is handled as addition by a negative number. This module is pipelined and requires four cycles to perform an operation. Figure 5: Addition Component 3.1.4 Multiplication In the multiplication unit, the floating point multiplication is performed in the following manner: 1) Perform integer addition on the exponents 2) Perform integer multiplication on the mantissas 3) Subtract the adjusted exponent by the bias 4) Merge the new mantissa and exponent components back into a proper floating point representation This module is pipelined and requires four cycles to perform an operation. 10

Figure 6: Multiplication Component 3.1.5 Division In the division unit, the floating point division is performed in the following manner: 1) Perform integer subtraction on the exponents 2) Perform integer division on the mantissas 3) Add the bias to the adjusted exponent 4) Merge the new mantissa and exponent components back into a proper floating point representation The integer division was implemented using the non-restoring division algorithm. [18] This module is pipelined and requires four cycles to perform an operation. Figure 7: Division Component 11

3.1.6 Square Root In the square root unit, the floating point square root operation is performed in the following manner: 1) perform integer subtraction on the exponents 2) subtract the bias from the exponent 3) shift the unbiased exponent to the right by 1 bit (essentially performing a divide by 2 operation) 4) perform integer square root on the mantissa 5) merge the new mantissa and exponent components back into a proper floating point representation The integer square root was implemented using the non-restoring square root algorithm. This module is pipelined and requires four cycles to perform an operation. Figure 8: Square root Component 3.1.7 Rounding and Normalizing Due to the introduction of guard bits when performing arithmetic operations on floating point values, it is necessary to normalize the resulting floating point value. To normalize the value, the implied 1 that was added during the denormalizing process is removed. Secondly, the width of the mantissa is reduced to comply with the IEEE 754 standard for single precision floating point representation. This form of truncation can lead to errors which can compound after a number of arithmetic iterations, so rounding is performed 12

based on either round to nearest or round to zero depending on the ROUND input signal. The output of the module is a single-precision normalized value. This module requires two cycles to perform the operation. Figure 9: Rounding and Normalizing Component 3.1.8 Integer and Fraction Extraction In the floating point component extraction unit, the integer and fraction extraction operations are performed in the following manner: 1) Set up the exponent bias based on the bit length of the exponent component of the floating point value 2) Determine the actual exponent by subtracting the bias from the exponent component of the floating point value 3) Create a bit mask for masking out the fractional component 4) Extract the integer value from the mantissa component using the bit mask 5) Extract the fractional value from the mantissa component using the inverse of the bit mask 6) Reconstruct the denormalized floating point representation of the fractional component by appending the sign and exponent bits of the original floating point value to the extracted fractional value 13

7) Reconstruct the normalized floating point representation of the integer component by appending the sign and exponent bits of the original floating point value to the extracted integer value 8) Pad the mantissa component of the extracted fractional value for normalization 9) Normalize fraction using the normalization unit This module requires one cycle to perform the operation Figure 10: Integer and Fraction Extraction Component 3.1.9 Conditional Floating Point Logic Comparator The conditional floating point logic comparator is used to perform logical comparisons between two floating point values. The output of the comparator is an 8-bit conditional code that is stored in the conditional code registers. The following comparisons are performed on the two floating point values: 1) equal to 2) not equal to 3) less than 4) less than or equal to 5) greater than or equal to, and 6) greater than. This module requires one cycle to perform the operation. 14

Figure 11: Comparator Component 3.1.10 Integer arithmetic unit An optional integer arithmetic unit is also available for integration into the processor. This unit can be used in situations where a complex floating point unit would not be necessary, such as in certain compression and encryption algorithms where only integer arithmetic is used to perform the algorithm. The integration of only integer arithmetic units in such situations would be beneficial from both a space and performance perspective. The unit takes in two (or three) 32-bit signed integer values as input depending on the operation. There is a clock line and a mode input that selects the current instruction as listing in the table above. The output is written to the output line as a 32 bit standard signed integer value. To set the integer operation to perform, the MODE bits are set in the following manner: 1) Add (Mode lines set to 0000) 2) Subtract (Mode lines set to 0001) 3) Multiply (Mode lines set to 0010) 4) Divide (Mode lines set to 0011) 5) MAC (Mode lines set to 0101) The Mode input is translated into an internal one-hot encoding scheme for selecting the appropriate arithmetic units to enable. This module requires one cycle to perform an arithmetic operation. 15

Figure 12: Integer Arithmetic Unit Component 3.1.11 Synthesis The floating point unit has been tested and synthesized for the Altera Stratix EP1S10 using Quartus II version 3.0 and 4.0 SP1. 3532 Logic Elements (LEs) and 89.5 MHz were reported after synthesis for size and clock speed respectively, a total pin count of 107 and 8 of the 48 available built-in multipliers. 3.2 Instruction Set Architecture This section describes the hardware design at the instruction set level. Several design goals have guided this phase of the development, including: Compatibility with the Sh API Instruction set scalability and orthogonality Latency hiding for various levels of parallelism Interleaved multithreading support The component diagram of an execution unit is shown in Figure 13. Note that each of the bus connections are multiplexed by the control signals arriving from the execution controller. 16

Figure 13: Execution Unit Component Diagram The component diagram below shows the execution controller. The execution controller also includes a high-speed instruction cache for kernels. 3.2.1 Word Size Figure 14: Execution Controller Component Diagram Most graphics processors support quad 32-bit floats as the basic data type. The operations that can be performed on the 32-bit quantities include component-wise multiplication, 17

addition, dot products; with arbitrary negation and swizzles (vector component reordering). For graphics processing, single-cycle execution of these instructions permits very fast execution of vertex transformations and lighting calculations, which make effective use of the 4-way parallel operations. Sh also supports quad-floats as one of its basic data types. The alternative to using quad-floats as a basic data type is to emulate vector operations with scalar operations at the hardware level. For applications making full utilization of the quad-float data pipe, this emulation will result in a drop in performance, as additional hardware is needed per floating point pipeline. However, for operations on streams of one to three floating point values, a single floating point pipeline could increase performance by improving utilization of the execution units. Our design uses a single floating point pipeline within each execution unit. This has the added advantage of reducing the minimum required area for synthesizing the design to an FPGA. 3.2.2 Bus Structure To allow a single instruction to complete every cycle requires sufficient register bandwidth to allow 2 input operands to be sent to the floating point unit and 1 output operand to be received from the floating point unit every cycle. This requires a 3-bus structure with 2 input buses and one output bus. One input bus and the output bus are also shared with the external routing interface for loading and storing external data to the register file. 3.2.3 Storage Structure Each execution unit requires a local register file, which must be fast (single-cycle latency) and also must support a significant number of registers. Since we are using a 3-18

bus structure, the register file must have 2 input and 1 output ports. A larger number of registers are required due to the limited access to memory in a stream architecture -- so all constants, results, and temporaries must be stored in the register file. In addition to local registers, many algorithms may require extended read-only storage for indexed constants or larger data structures. These can be stored in a block of readonly SRAM within each execution unit, into which indexed lookups may be performed. Data in this local memory must be transferred to registers before it can be operated on. These indexed memories can use a single input and output port. We have implemented a single register file with a variable number of 32-bit floating point registers. The instruction word allows up to 256 floating point registers in the future. The indexed memory could be larger (perhaps 1024 registers). 3.2.4 Register Windowing To increase the number of effective registers, it is possible to use register windows in which parts of a single large register file are exposed depending on the current window location. For example, the register file could hold 1024 32-bit floating point registers but expose only 64-word windows, which reduces the complexity of the logic in the register file as well as reduces the width of register references in instruction words. Register windowing is typically used to prevent register spilling (that is preventing registers from being stored into main memory when a procedure is called). Register windowing is useful to a stream processor because the amount of temporary storage that is allowed by a fixed register file may be more limited than what is demanded by some applications. However, the addition of register windowing to the hardware design should not require large changes (the addition of register window manipulation instructions and compiler support). For this reason we have elected not to 19

support register windowing in the initial design, while it may be included in future revisions. 3.2.5 Pipeline The arithmetic units that are used within each execution unit include the components shown in Table 1. There are two different latencies for the units in the design, resulting in a complication when trying to pipeline the entire design. To create the pipeline for these units, we were faced with the choice of either fixing the pipeline length at the longer of the two latencies (4 execute stages) and incurring a performance penalty for the slower instructions, or supporting both latencies in separate pipelines and automatically resolving the structural hazards arising from this configuration. We chose a design that allows for multiple variable pipeline lengths. Table 1: Functional Unit Latencies Component Floating point ALU Floating point comparator Floating point fractional / integer converter Indexed load/store Latency 4 cycles 1 cycle 1 cycle 1 cycle To understand the design, it is necessary to consider all possible pipeline hazards. 1. Structural hazards: A structural hazard arises when a single hardware unit is needed simultaneously by multiple sources. By allowing multiple instruction latencies in the execution stage, we introduce a structural hazard on the writeback bus. 2. Control hazards: Control hazards arise when a conditional branch instruction causes instructions that have already been partially completed to be invalidated. The stream processor does not implement conditional branch instructions, so there is no chance of a control hazard arising. 20

3. Data hazards: There are 3 variants of data hazards: read after write (RAW), write after read (WAR), and write after write (WAW). Without further restrictions on instruction completion order, each of these types of data hazards are possible in the stream processor. To avoid WAR and WAW hazards, it is sufficient to enforce inorder writeback of instructions. Avoiding RAW hazards requires an additional hazard detection unit. To automatically solve all hazard detection problems (structural and data hazards), we have elected to design a hardware writeback queue to track instructions in the pipeline. 3.2.6 Writeback Queue The simplicity of the execution unit design ensures that the only state changing operation that can occur is the modification of a register. Registers can only be modified in the common writeback stage for all instructions. The writeback queue makes sure that 1) instructions are issued only when there is no conflict on the writeback bus and 2) instructions are issued only when there is no pending instruction that needs to write to one of the current read registers. Within the queue are a number of registers. Each register in the queue stores the 8-bit register as well as an enable bit for each pending writeback operation. The queue slots are numbered from 0 to 7, with register 0 containing the instruction that is currently present in the writeback stage. Newly issued instructions are inserted in the queue slot determined by the latency of the instruction. In our case, instructions are placed into the queue during the decode stage, so FPU instructions are added to slot 5 while single-cycle execute instructions are added to slot 2. The queue is a shift register with all registers being shifted towards 0. The stall signal is issued under two conditions: 1. When the desired register in the queue is already enabled a conflict has been detected on the bus. 21

2. When any enabled queue register contains a writeback register equal to one of the registers being read a data dependency needs to be avoided. The hardware complexity to detect the first condition is trivial. For the second condition, a comparator is required for every queue slot for every read register. So in the case of 8 queue slots and 2 possible read registers, the number of 8-bit comparators required is 16. 3.2.7 Multithreading The most important characteristic of the stream programming model is that each element can be operated on in parallel. To increase utilization of the pipeline and reduce the possibility of processor stalls, it is possible to take advantage of this characteristic by implementing multiple threads of execution within a single execution unit. There are several methods of implementing multithreading, ranging from very fine-grained (switching threads on every clock cycle) to coarse-grained (switching threads after a fixed time interval or only on costly operations like cache miss reads). Since our design does not permit any operation latencies of greater than 7 cycles, an efficient means of decreasing the effective latency is to support fine-grained interleaved multithreading. Interleaved multithreading is supported by duplicating the state-holding components in the design for each thread so in our case, each register file needs to be duplicated for each thread. In addition, the control hardware needs to broadcast the currently active thread for each of the pipeline stages decode, execute, and writeback. 22

Instruction Figure 15: Interleaved Multithreading To support multithreading, we have added an additional queue to the instruction controller that stores the active thread for each active cycle. When control signals are broadcast to each execution unit, the thread identifier is passed along so that the appropriate state-holding component is selected at the execution unit. Note that a different thread can be active for each pipeline stage, so when the signals are broadcast the thread identifier needs to be duplicated for each of the decode, execute, and writeback stages of the pipeline. 3.2.8 Conditionals Conditionals are handled in various ways in SIMD processors. In general, conditionals cannot be implemented directly in a SIMD processor because the processor array is required to remain in lockstep (ie. each processor executes the same instruction on a given clock cycle). Traditionally, conditionals have been implemented by masking the execution of processors for which a condition fails. We have chosen to implement conditionals using the method of predication, in which all processors first evaluate all conditional branches and then conditionally write back their results into the register file. 23

Conditionals are handled in the Sh intermediate representation through conditional assignment of 0.0 or 1.0 to a floating point value based on a comparison of two other floating point values. Although conditional support using floating point values is easy to implement, this approach leads to additional usage of floating point registers to store Boolean results. A more efficient alternative would be to support a new series of singlebit registers for storing Boolean values. We have decided to implement boolean operations using dedicated Boolean registers. This reduces the number of floating point instructions and relieves floating point registers from storing Boolean values. 3.2.9 Instruction Set The instruction set includes all the operation shown in Table 2 with the given latency and functional unit. Table 2: Instruction Set Instruction Latency Functional Operation Unit NOP 4 BUS None ADD 7 FPU Rd R1 + R2 MUL 7 FPU Rd R1 * R2 DIV 7 FPU Rd R1 / R2 SQRT 7 FPU Rd sqrt(r1) INT 4 FRAC Rd floor(r1) FRAC 4 FRAC Rd R1 floor(r1) CMP 4 CMP Cd compare(r1, R2) COND 4 BUS Rd {R1 when C1} LDI 4 IDX Rd index(r1) STI 4 IDX index(r1) R2 GET 4 BUS {Rd route(imm) : C1} PUT 4 BUS {route(imm) Rd : C1} CON 4 BUS Rd scatter(imm) 24

Operation Notation: 1. Rd denotes a destination register. R1 and R2 denote the two register inputs. 2. Cd denotes a destination Boolean register. C1 denotes an input Boolean register. 3. imm denotes an input immediate value. 4. compare(): This method will perform the comparison operation described below. 5. index(): perform a lookup in the index constant table using the value of the given register as the lookup. 6. route(): This operation invokes the functionality of the inter- execution unit routing network. In particular, it is used to compact outgoing data when some units do not produce output and to route incoming data to the execution units needing new elements. Functional units are the components that are used during the execute (EX) stage of pipeline execution. The functional units that are available are shown in Table 3. Table 3: Functional Unit Identification Functional Unit BUS FPU FRAC CMP IDX Description The operation uses only the internal execution unit buses The operation invokes the floating point FPU The operation uses the integer/fractional part extraction unit The operation uses the floating point comparator unit The operation uses the index constant table Instruction Format The generic instruction bit layout is: 25

Table 4: General instruction format Opcode Rd or Cd (Unused) Negate 1 R1 Negate 2 R2 31..27 (5) 26..19 (8) 18 (1) 17 (1) 16..9 (8) 8 (1) 7..0 (8) The Negate 1 and Negate 2 bits allow for free asynchronous negation of input values. This allows any instruction to negate its inputs. Instructions that use conditional registers have the following format: Table 5: Conditional Register format Opcode Rd (Unused) C1 reg C1 imm condition 31..27 (5) 26..19 (8) 18..17 (2) 16..12 (5) 11..9 (3) 8..0 (9) The condition register field determines which Boolean register should be used. Up to 32 independent Boolean registers are supported by this encoding. Each Boolean register stores 8 bits indicating the decoded condition codes. The possible conditions are shown in [table]. Table 6: Condition Codes Value Condition 0 Never satisfied 1 Equal 2 Not Equal 3 Less than 4 Less or equal 5 Greater than or equal 6 Greater than 7 Always satisfied 26

This choice of instruction format allows scalability in the number of operations (up to 32) and scalability in the size of the local register file (up to 256 entries). 3.2.10 Instruction Controller The active kernel will be stored in SRAM local to the instruction control hardware. The control hardware will be a simple state machine that reads and issues instructions. Because there are no branching instructions, and all data dependencies are resolved at compile time, the controller will simply step through the program and issue one instruction per cycle. Instructions will be decoded at the individual execution units, so the common instruction controller will consist mainly of the state machine. The alternative is to decode instructions at the execution units, which would introduce a significant number of new signals to be routed from the instruction controller to the execution units. We have elected not to use this approach. 3.3 Memory System This section describes the on-chip memory system including the stream register file (SRF), SRF controller, and inter-element routing network. The memory system was designed with a number of design goals in mind. These include: Direct support for the stream and channel primitives exposed by Sh Bandwidth scalability to support an array of execution units Hardware support for conditional stream operations The stream register file is intended to store input, output, and temporary channels that are allocated during execution. A channel is a sequence of independent 32-bit floating point 27

values. For operations on records containing many 32-bit values, the records must first be decomposed into channels and then stored into the SRF separately (for example, vectors <x,y,z> must be decomposed into an x-, y-, and z- channel before processing). The purpose of the stream register file is to provide high bandwidth on-chip storage component so that intermediate results can be stored and accessed quickly by the execution array. The SRF is composed of several memory banks that permit parallel access. The number of banks in the SRF is coupled to the number of processors that need to make parallel access to the SRF so for a processing array containing 8 processors, it is necessary to provide 8 equal-sized banks in the SRF memory. 3.3.1 Channel Descriptors Only a single channel in the SRF can be accessed at once. When the channel is accessed, any number of processors can retrieve data from the channel. Since channels are always accessed in sequential order, it is necessary to keep track of the current address for access within that channel. This is done by storing an array of channel descriptors. Each channel descriptor consists of the current address and final address within that channel. When the current address reaches the final address, the channel is empty and any further references to that channel are rejected with an error. 28

Figure 16: Description of channels in SRAM 3.3.2 Routing interface For interfacing with the routing hardware, it is sufficient for the stream register file to store the current channel address for each channel. When a routed memory read is performed, the memory interface decodes the low-order bits of the channel address to form an array of offset bits for each bank of the SRF. These offset bits store whether a given bank of memory will need to read from the current or next line of the bank. When a memory read is performed, the number of execution units requesting new data determines the amount that the memory address needs to be increased, and the next sequence of banks are used to retrieve the memory. 29

3.4 Routing Network The routing network is what connects all the execution units to the static RAM banks. The same network is used to fulfill both roles of providing the processing elements with data and with storing the output back into RAM. It has been designed such that it can operate in either direction, from memory to PEs or from PEs to memory. The complexity involved with constructing such a network is based on three different aspects. Firstly, the network has to be parameterizable as it has to be able to expand to the number of PEs used during a simulation. Secondly, each PE has been designed with conditional read and conditional write commands. These conditions are based on the condition codes set during the execution of the program. Thus the network has to handle reading or writing a number of data elements that could be fewer than the most number of processing elements available. Lastly, the network has been constructed so that each PE does not need to worry about the next memory location from which data will be available. This is especially important given that each PE would not necessarily be reading data from memory whenever a read instruction is invoked. Thus, when a PE executes a read instruction, it shouldn t need to worry about this issue. To accomplish the tasks mentioned above, the network has to be divided into two components, a barrel shifter and a compactor. The barrel shifter reorders the data being transferred. The compactor is what removes any gaps that may exist in the data. This is useful when certain PEs do not produce any data. This way data being stored to memory does not have any gaps between them. The same compactor can be used to introduce gaps between data elements when certain PEs do not need to read any new data. The compactor and the barrel shifter together provide data to each PE and write back any output produced while taking care of the three above mentioned issues. 3.4.1 Barrel Shifter 30

The barrel shifter is quite a simple design. This is important when reading in data since the first physical processing element might not be the first one that is actually accepting data. It is also important during the write back stage because it allows the network to reorder data to the next available slot in SRAM. It is basically built up of many 2x2 multiplexers. The whole barrel shifter has log 2 (n) stages. This ensures that when the number of processing elements grows, the barrel shifter does not grow out of proportion. Below is a diagram of a barrel shifter with four inputs. mux mux mux mux mux mux mux mux Figure 17: Barrel Shifter network 3.4.2 Compactor The compactor is a network built that compress data elements when writing to memory to remove any gaps. As mentioned before, these gaps could be produced when processing element executes a put instruction. Since each put instruction is based on a condition code, if the condition is not satisfied, the processing element will not output data. Therefore, there could be gaps between data elements written back to memory. Similarly, during a get instruction, if the condition code isn t satisfied, the processing element will not read in new data. Thus, the same network is used to produce gaps 31

between data elements to account for PEs that do not request new data. Below is a diagram of the compactor. It is also a network that has log 2 (n) stages. mux mux mux Figure 18: Compactor network 3.5 Assembler An assembler has been written in C++ for converting programs written using the xstream instruction set into xstream proprietary assembly code. The assembler also needs to be used in conjunction with the data assembler. In our project, we have not completed an external memory interface that would allow us to refresh the data values stored in SRAM. Therefore, when we run simulations, we need to assemble the data values as well. Thus, the channel descriptions have to be specified and the input data has to be specified. The final output of both the assembler and the data assembler is the opcode for the instructions and the actual values that are stored in the SRAM banks. 32

3.6 External Memory 3.6.1 Overview The memory subsystem was designed to work as a three-layer system as depicted in Figure 1. The SDRAM model was obtained from McGill University. Attached to it is the controller that handles row, column and bank decoding, RAS and CAS timings as well as the initial Load and Precharge times. The controller supports page reads and writes for high bandwidth block transfers, which are necessary given the streaming architecture of the processor. SDR Model mc-ram bus Controller P-mC bus Processor Figure 19: An overview of the external memory architecture 3.6.2 SDRAM Model The SDRAM model has three basic modes. First is the uninitialized mode before values have been loaded from disk. Next is the powerup mode where power is applied to 33

memory. Last is the command mode which indicates that the memory is ready to receive commands. Figure 20: SDRAM Component The list of commands that can be issued to the SDRAM model are presented below. They are issued to the SDRAM model via specific selections of the scsn, scasn, srasn, sa[10] and swen [19]. Table 7: Command Listings Command DESL NOP READ WRITE ACT PRE PALL MRS REF Description No command Do nothing Read burst of data Write burst of data Activate Row Precharge selected bank Precharge all Mode Register Select Refresh 34

3.6.3 Memory Controller The block diagram for the memory controller is given above. The controller operates on two clocks, each corresponding to a separate bus. It accepts a clock input from the processor and has an arbitrarily set clock (via generics) for the controller->memory bus (not shown). This bus has been set to 100 MHz. The procclk signal is the clock between the controller and the processor. The memory controller takes in a 23 bit address of which 13 are row and 8 are column addresses. The remaining two refer to the bank. Page size was set to 256 words. The data bus itself is 64 bits wide which means that the RAM has a total capacity of 64 MB. The r and w lines correspond to the read and write functions. Blocklen is the length of the burst transfer and the resetn is the negated reset line. The receive_rdy line indicates that the processor is ready to receive data from the controller while the blockrdy line indicates that a block is ready for burst transfer. The ready line indicates that the memory controller is ready to receive data from the processor. Figure 21: Memory Controller Component The controller states and their functions are presented below: 35

mload mprecharge mmrs mrefresh mnop mactive mwritede lay r=0,w=1 READ/WRITE r=1,w=0 mread mwrite mpagereaddelay while bytecount < pagesize mpagewritedelay while bytecount < pagesize Figure 22: Memory Controller State Diagram 36

Table 8: Description of states in memory controller diagram State mload mprecharge mmrs mrefresh mactive mread mwrite mnop mpagereaddelay mpagewritedelay mwritewait Description Initial state. Loading from disk Precharge delay Mode register select state Row refreshing Row ready for column access Currently in Burst read Currently in Burst write No operation Page read delay Page write delay Delay while internal controller buffer is filled 37

4. Applications There are three applications that we wrote to test the xstream processor. The three different programs show that the instruction set of the xstream processor has been designed to be accommodating for most tasks that can be broken up into parallel streams. Some of the tests were fairly simple as they only tested the basic functionality of the processor. This was sufficient to ensure that we met all the requirements. The first is a fairly simple program that normalized a set of vectors. The second test was to generate a fractal image. The third test is a much larger project that is still in the works. It involves carrying out the process of rasterization of an image. 4.1 Normalize The task was to normalize a set of vectors. We setup three input channels and three output channels. The output provided after the end of the simulation gave us the normalized vectors that we verified with a calculator. 4.2 Fractal Generation The second algorithm was used to produce a Mandelbrot fractal image. This is an algorithm where more data set produced was larger than the data set used as input. The output data was used to generate the final image using a python program. Below you can see the image generated by the simulation. 38

Figure 23: Fractal image generated by simulation 4.3 Rasterization Lastly, a rasterization program was attempted on the processor as well. Rasterization is the process where an image is parsed to determine what colour each pixel should have on the screen. There are many ways of carrying out the process. The algorithm used in our case was one that was better suited to stream processing architectures. The entire process is broken up into different stages, each of which is simulated on the processor. This is one of the better examples that the xstream processor is fit to use for real world applications. Rasterization is carried out everyday by the graphics cards that we use in our computers. 39