University of Waterloo Faculty of Engineering Department of Electrical and Computer Engineering. xstream Processor. Group #057

Save this PDF as:

Size: px
Start display at page:

Download "University of Waterloo Faculty of Engineering Department of Electrical and Computer Engineering. xstream Processor. Group #057"

Transcription

1 University of Waterloo Faculty of Engineering Department of Electrical and Computer Engineering xstream Processor Group #057 Consultant: Dr. William Bishop Andrew Clinton ( ) Sherman Braganza ( ) Alex Wong ( ) Asad Munshi ( ) Jan 14, 2005

2 Abstract Stream Processing is a data processing paradigm in which long sequences of homogeneous data records are passed through one or more computational kernels to produce sequences of processed output data. Applications that fit this model include polygon rendering (computer graphics), matrix multiplication (scientific computation), 2D convolution (media processing), and encryption. Computers that exploit stream computations are able to process data much faster than conventional microprocessors because they have a memory system and execution model that permits high on-chip bandwidth and high arithmetic intensity (operations performed per memory access). We have designed a general-purpose, parameterizable, SIMD stream processor that operates on 32-bit IEEE floating point data. The system is implemented in VHDL, and consists of a configurable FPU, execution unit array, and memory interface. The FPU supports fully pipelined operations for multiplication, addition, division, and square root, with a configurable data width. The execution array operates in lock-step with an instruction controller, which issues 32-bit instructions to the execution array. To exploit stream parallelism, we have exposed parameters to choose the number of execution units as well as the number of interleaved threads at the time the system is compiled. The memory system allows all execution units to access one element of data from memory in every clock cycle. All memory accesses also pass through an inter-unit routing network, supporting conditional reads and writes of stream data. We have performed clock cycle simulation of our design using various benchmark programs, as well as synthesis to Altera FPGA to verify component complexity. The eventual goal for this project is for the design to be synthesizable on a large FPGA or ASIC. ii

3 Acknowledgements We would like to thank our supervisors Prof. William Bishop and Prof. Michael McCool who have helped us to understand the complex field of computer hardware design our project would not have been possible with their kind assistance. We also wish to thank the creators of the Symphony EDA VHDL simulation tools that have made it possible for us to develop our system extremely rapidly at mimimal cost. iii

4 Table of Contents Abstract... ii Acknowledgements...iii List of Figures... vi List of Tables... vii 1. Introduction High Level Design Stream Processing on the xstream Processor Detailed Design Floating Point Unit Hardware Modules Denormalizing Addition Multiplication Division Square Root Rounding and Normalizing Integer and Fraction Extraction Conditional Floating Point Logic Comparator Integer arithmetic unit Synthesis Instruction Set Architecture Word Size Bus Structure Storage Structure Register Windowing Pipeline Writeback Queue Multithreading iv

5 3.2.8 Conditionals Instruction Set Instruction Controller Memory System Channel Descriptors Routing interface Routing Network Barrel Shifter Compactor Assembler External Memory Overview SDRAM Model Memory Controller Applications Normalize Fractal Generation Rasterization Discussion and Conclusions Future Possibilities FPU Optimizations Synthesis Off-chip DRAM References...42 v

6 List of Figures Figure 1: xstream Processor Block Diagram... 3 Figure 2: Component depiction of FPU... 7 Figure 3: Diagram depicting four stage pipeline used in FPU... 8 Figure 4: Denormalize Component... 9 Figure 5: Addition Component Figure 6: Multiplication Component Figure 7: Division Component Figure 8: Square root Component Figure 9: Rounding and Normalizing Component Figure 10: Integer and Fraction Extraction Component Figure 11: Comparator Component Figure 12: Integer Arithmetic Unit Component Figure 13: Execution Unit Component Diagram Figure 14: Execution Controller Component Diagram Figure 15: Interleaved Multithreading Figure 16: Description of channels in SRAM Figure 17: Barrel Shifter network Figure 18: Compactor network Figure 19: An overview of the external memory architecture Figure 20: SDRAM Component Figure 21: Memory Controller Component Figure 22: Memory Controller State Diagram Figure 23: Fractal image generated by simulation vi

7 List of Tables Table 1: Functional Unit Latencies Table 2: Instruction Set Table 3: Functional Unit Identification Table 4: General instruction format Table 5: Conditional Register format Table 6: Condition Codes Table 7: Command Listings Table 8: Description of states in memory controller diagram vii

8 1. Introduction Many applications of computers contain a high degree of inherent data parallelism. Such applications include graphics rendering, media processing, encryption, and image processing algorithms each of which permit operations on different data elements to proceed in parallel. A fundamental problem of computer hardware design is to expose as much of this parallelism as possible without compromising the desired generality of the system. One such approach is stream processing, in which data parallelism is exposed by processing data elements independently and in parallel. In stream processing, each record in a stream is operated on independently by a kernel a small program applied to each data element, potentially in parallel. The result of applying a kernel to a stream produces another stream. Significantly, the order in which streams and kernels are applied is very flexible, allowing the order in which operations are performed to be optimized. In this way, stream processing can be implemented in a way that minimizes resource use in a hardware design, allowing more computations to be performed using less area in a VLSI design and less memory bandwidth. Most past research in stream processing has been pursued at Stanford University in the development of the Imagine stream processor ([1], [2], [3], [4], [5]). The Imagine is a VLIW stream processor developed primarily for media processing, with a prototype ASIC implementation. The Imagine supports a 3-level memory hierarchy including offchip RAM, a high-bandwidth stream register file, and high-speed local registers contained in each processing element. Research at Stanford has investigated the implementation of conditional streams [3], in which conditionals are implemented in a SIMD architecture through multiple passes and on-chip routing. Research has also investigated low-level (kernel) scheduling for VLIW instructions and high-level (stream control) scheduling [5] for placing and moving streams within the stream register file. Research has also been pursued in configurable parallel processors in Texas [6], for the TRIPS processor. 1

9 Parallel processing has also undergone rapid development in the area of graphics processors (GPUs). The PixelFlow [7] system was one of the first to take advantage of parallel processing over rendered pixels. Current-generation GPUs now support highly flexible instructions sets for manipulating pixel shading information, while exploiting fragment parallelism. Hardware features have even been proposed to make GPUs more like stream processors ([8], [9], [10]). Many researchers have recognized the potential performance exposed by GPUs and have ported a wide range of parallelizable algorithms to the GPU, including linear algebra operations, particle simulations, and ray tracing ([11], [12]). The introduction of general purpose GPU programming has led to the development of several high-level tools for harnessing the power of these devices. These attempts include the development of Brook for GPUs [13] (a project from Stanford, aimed at exposing the GPU as a general stream processor) and Sh ([14], [15], [16]) (a project at the University of Waterloo, which targets GPUs from ordinary host-based C++ code an approach called metaprogramming. Our project aims to merge results in these two areas by developing a stream processor that is compatible with existing API s for GPU programming. In particular, our device allows most of the Sh system to be executed in hardware, including single precision floating point support. Our system also supports an inter-unit routing system allowing us to support conditional streams. We have designed our system to be efficient, modular, and simple to program. 2

10 2. High Level Design A high-level block diagram of the processor is shown in Figure 1. There are several key components of the design, including a floating point unit (embedded within each of the processing elements), the execution units themselves, and an on-chip memory system consisting of an inter-unit routing network and fast stream register file. The DRAM and host are components that would be integrated eventually for a full design. Figure 1: xstream Processor Block Diagram 3

11 2.1 Stream Processing on the xstream Processor The design implements the stream processing model by enforcing a restricted memory access model. All memory access from the execution unit array must pass through the stream register file, and all memory access must read or write at the head of the stream. The stream register file stores streams as a sequence of channels each channel stores a component of a stream record. The channels are configured at startup of the device, and are dynamically read and written during execution. Data parallelism is exploited in two major ways in the design. First, data elements can be processed in parallel by an array of processing elements. This results in a speedup of roughly N times for an array containing N processors, as there is little communication overhead for independent stream accesses. As the number of processors is increased, the memory and routing network automatically scale to match the new array size (with the routing only rising as N*log(N)). Second, interleaved multithreading is used within each processing element with an arbitrary number of threads, trading instruction level parallelism for the available (unlimited) data parallelism. The execution model allows for conditional execution of read or write instructions to the memory bus. The system automatically routes stream data to the correct memory bank when processors are masked. Execution units are fully pipelined so (with multithreading) a throughput of 1 instruction per cycle is almost always achieved. 4

12 3. Detailed Design This section describes in detail each of the components in the system. This section constructs the entire design starting with the low-level components (floating point) and proceeding through to the execution unit and instruction set, and finally to the memory system and routing network. 3.1 Floating Point Unit The floating point arithmetic unit detailed in this document is based on a core parameterized floating point library developed at the University Rapid Prototyping Lab (RPL) [17]. A number of extensions were made to the existing core library to provide additional arithmetic operations such as floating point division and square root operations. Furthermore, a parameterized floating point arithmetic unit interface was created to provide a single core component for general interfacing with the rest of the processor architecture. In contrast to developing the floating point unit from the ground up, it was decided that it was more efficient to build upon an existing core and add the functionality that is needed for the project. The following floating point libraries were evaluated: RPL library of variable floating point modules Single Precision Floating Point Unit Core Project from Portland State University FPLibrary library of floating point modules from Arénaire project FPHDL Floating point library by Professor David Bishop The FPLibrary implementation was discarded early on due to the high complexity of the design, making it virtually incomprehensible for use in our project. The implementation from Portland State was also discarded early on for a number of reasons. First of all, the synthesized performance of the unit was below that of the RPL and FPHDL 5

13 implementations. Secondly, the core was hard-coded to perform only single precision floating point arithmetic and so provides a constraint on future expansion to double precision arithmetic calculations. Most importantly, the unit failed to provide correct results when calculating simple test values. Between the RPL library and the FPHDL implementation, it was decided that the RPL library was most appropriate for our purposes for a number of reasons. First of all, the FPHDL implementation is still in a relatively early stage and so stability and reliability issues may arise from using it. Secondly, the way that the VHDL code was structured was very generic and not very hardware-centric. Therefore, the operations were not pipelined and not area conscious and would require a high-end synthesis tool such as Synopsys to generate hardware that is optimized for performance. The RPL library, on the other hand, was designed to be very hardware-oriented and provided a fully pipelined structure that was efficient and parameterized Hardware Modules The FPU core used in the project has the following features: Implements Single Precision (32-bit) Implements Floating point addition, subtraction (by negation), multiplication, division and square root Implements two of the four IEEE standard rounding modes: round to nearest and round to zero Exceptions are partially implemented and reported according to the IEEE standard The above specification was decided based on a number of important design decisions. For the purpose of the project, the parameterized floating point core unit was configured to comply with the IEEE 754 standard for single precision floating point representation. Therefore, the number of mantissa bits and exponent bits are set to 23 bits and 8 bits respectively, with a single sign bit. This would provide us with the level of precision 6

14 needed to perform basic floating point calculations, while still being manageable in terms of size and testing effort. Operations on denormalized floating point numbers are not supported because, based on research done by Professor Miriam Leeser at the Rapid Prototyping Lab, it would require a great deal of extra hardware to implement and therefore is not necessary for our purposes. The above list of floating point arithmetic operations was chosen as it provides the functionality needed for a subset of the Sh API and upon which all other operations can be built upon. Only two of the four IEEE standard rounding modes were implemented as the additional rounding modes (+inf and -inf) were deemed unnecessary for our purposes. Finally, only part of the exception handling specified by IEEE standard was implemented in the FPU core. The inexact exception is ignored while all other exceptions generate zero values at the output of the FPU along with a general exception flag. This was chosen to provide compatibility with the Sh API. Figure 2: Component depiction of FPU The unit takes in two such 32-bit floats as input. In order for the units to begin execution, the ready line must also be set to high. An extra line has been provided for exception input to handle the case where multiple FPU's are chained together. There is a clock line and a mode input that selects the current instruction as listing in the table above. The output is written to the OUT1 line as a 32 bit standard single precision float. In addition, in the event of a divide by zero or the square root of a negative number being taken, the exception output line is also driven high. Finally, there is a line to indicate when the 7

15 current instruction is completed, although this may prove unnecessary in the future. To set the floating point operation to perform on the FPU, the MODE bits are set in the following manner: 1) Add (Mode lines set to 0000) 2) Multiply (Mode lines set to 0010) 3) Divide (Mode lines set to 0011) 4) Square Root (Mode lines set to 0100) The Mode input is translated into an internal one-hot encoding scheme for selecting the appropriate arithmetic units to enable. The FPU has been designed to operate using a six-stage pipeline. The data flow of the unit is presented below. Round Operand A Operand B Operation Select De-normalize into 33 bit floats FP Operation Addition (4 stages) Multiplication (4 stages) Division (4 stages) Square Root (only accepts Operand A) (4 stages) Round and normalize into IEEE floating point representation (2 stages) Exception Generator 32-bit result Exception out Figure 3: Diagram depicting four stage pipeline used in FPU 8

16 The first four stages involve the denormalizing of the floating point numbers and the arithmetic operation. The last two stages involves the rounding and normalizing of the floating point value. The names and functions of the hardware modules used by the unit are listed below: Denorm (denormalize floating point numbers) rnd_norm (normalizing and rounding results) fp_add (perform addition on two floats) fp_mul (perform multiplication on two floats) fp_div (perform division on floats) fp_sqrt (generate the square root of a float) Denormalizing Figure 4: Denormalize Component In the normalized format of a floating point number, a single non-zero bit is used to represent the integer in the mantissa. However, in the IEEE 754 normalized binary representation, the non-zero bit is discarded to save space as it is implied to be there. However, this implied 1 is required to perform proper arithmetic operations. Therefore, the task of the denormalizing module is to add the implied 1 value back into the binary representation. The output of the module is 1 bit greater than the input value. The denormalizing module is asynchronous and purely combinational. 9

17 3.1.3 Addition In the addition unit, the floating point addition is performed in the following way: 1) align the mantissas of the input values 2) perform integer addition on the mantissas 3) shift resultant mantissa to the right by one bit and increment exponent by one if overflow occurs 4) merge the new mantissa and exponent components back into a proper floating point representation Subtraction is handled as addition by a negative number. This module is pipelined and requires four cycles to perform an operation. Figure 5: Addition Component Multiplication In the multiplication unit, the floating point multiplication is performed in the following manner: 1) Perform integer addition on the exponents 2) Perform integer multiplication on the mantissas 3) Subtract the adjusted exponent by the bias 4) Merge the new mantissa and exponent components back into a proper floating point representation This module is pipelined and requires four cycles to perform an operation. 10

18 Figure 6: Multiplication Component Division In the division unit, the floating point division is performed in the following manner: 1) Perform integer subtraction on the exponents 2) Perform integer division on the mantissas 3) Add the bias to the adjusted exponent 4) Merge the new mantissa and exponent components back into a proper floating point representation The integer division was implemented using the non-restoring division algorithm. [18] This module is pipelined and requires four cycles to perform an operation. Figure 7: Division Component 11

19 3.1.6 Square Root In the square root unit, the floating point square root operation is performed in the following manner: 1) perform integer subtraction on the exponents 2) subtract the bias from the exponent 3) shift the unbiased exponent to the right by 1 bit (essentially performing a divide by 2 operation) 4) perform integer square root on the mantissa 5) merge the new mantissa and exponent components back into a proper floating point representation The integer square root was implemented using the non-restoring square root algorithm. This module is pipelined and requires four cycles to perform an operation. Figure 8: Square root Component Rounding and Normalizing Due to the introduction of guard bits when performing arithmetic operations on floating point values, it is necessary to normalize the resulting floating point value. To normalize the value, the implied 1 that was added during the denormalizing process is removed. Secondly, the width of the mantissa is reduced to comply with the IEEE 754 standard for single precision floating point representation. This form of truncation can lead to errors which can compound after a number of arithmetic iterations, so rounding is performed 12

20 based on either round to nearest or round to zero depending on the ROUND input signal. The output of the module is a single-precision normalized value. This module requires two cycles to perform the operation. Figure 9: Rounding and Normalizing Component Integer and Fraction Extraction In the floating point component extraction unit, the integer and fraction extraction operations are performed in the following manner: 1) Set up the exponent bias based on the bit length of the exponent component of the floating point value 2) Determine the actual exponent by subtracting the bias from the exponent component of the floating point value 3) Create a bit mask for masking out the fractional component 4) Extract the integer value from the mantissa component using the bit mask 5) Extract the fractional value from the mantissa component using the inverse of the bit mask 6) Reconstruct the denormalized floating point representation of the fractional component by appending the sign and exponent bits of the original floating point value to the extracted fractional value 13

21 7) Reconstruct the normalized floating point representation of the integer component by appending the sign and exponent bits of the original floating point value to the extracted integer value 8) Pad the mantissa component of the extracted fractional value for normalization 9) Normalize fraction using the normalization unit This module requires one cycle to perform the operation Figure 10: Integer and Fraction Extraction Component Conditional Floating Point Logic Comparator The conditional floating point logic comparator is used to perform logical comparisons between two floating point values. The output of the comparator is an 8-bit conditional code that is stored in the conditional code registers. The following comparisons are performed on the two floating point values: 1) equal to 2) not equal to 3) less than 4) less than or equal to 5) greater than or equal to, and 6) greater than. This module requires one cycle to perform the operation. 14

22 Figure 11: Comparator Component Integer arithmetic unit An optional integer arithmetic unit is also available for integration into the processor. This unit can be used in situations where a complex floating point unit would not be necessary, such as in certain compression and encryption algorithms where only integer arithmetic is used to perform the algorithm. The integration of only integer arithmetic units in such situations would be beneficial from both a space and performance perspective. The unit takes in two (or three) 32-bit signed integer values as input depending on the operation. There is a clock line and a mode input that selects the current instruction as listing in the table above. The output is written to the output line as a 32 bit standard signed integer value. To set the integer operation to perform, the MODE bits are set in the following manner: 1) Add (Mode lines set to 0000) 2) Subtract (Mode lines set to 0001) 3) Multiply (Mode lines set to 0010) 4) Divide (Mode lines set to 0011) 5) MAC (Mode lines set to 0101) The Mode input is translated into an internal one-hot encoding scheme for selecting the appropriate arithmetic units to enable. This module requires one cycle to perform an arithmetic operation. 15

23 Figure 12: Integer Arithmetic Unit Component Synthesis The floating point unit has been tested and synthesized for the Altera Stratix EP1S10 using Quartus II version 3.0 and 4.0 SP Logic Elements (LEs) and 89.5 MHz were reported after synthesis for size and clock speed respectively, a total pin count of 107 and 8 of the 48 available built-in multipliers. 3.2 Instruction Set Architecture This section describes the hardware design at the instruction set level. Several design goals have guided this phase of the development, including: Compatibility with the Sh API Instruction set scalability and orthogonality Latency hiding for various levels of parallelism Interleaved multithreading support The component diagram of an execution unit is shown in Figure 13. Note that each of the bus connections are multiplexed by the control signals arriving from the execution controller. 16

24 Figure 13: Execution Unit Component Diagram The component diagram below shows the execution controller. The execution controller also includes a high-speed instruction cache for kernels Word Size Figure 14: Execution Controller Component Diagram Most graphics processors support quad 32-bit floats as the basic data type. The operations that can be performed on the 32-bit quantities include component-wise multiplication, 17

25 addition, dot products; with arbitrary negation and swizzles (vector component reordering). For graphics processing, single-cycle execution of these instructions permits very fast execution of vertex transformations and lighting calculations, which make effective use of the 4-way parallel operations. Sh also supports quad-floats as one of its basic data types. The alternative to using quad-floats as a basic data type is to emulate vector operations with scalar operations at the hardware level. For applications making full utilization of the quad-float data pipe, this emulation will result in a drop in performance, as additional hardware is needed per floating point pipeline. However, for operations on streams of one to three floating point values, a single floating point pipeline could increase performance by improving utilization of the execution units. Our design uses a single floating point pipeline within each execution unit. This has the added advantage of reducing the minimum required area for synthesizing the design to an FPGA Bus Structure To allow a single instruction to complete every cycle requires sufficient register bandwidth to allow 2 input operands to be sent to the floating point unit and 1 output operand to be received from the floating point unit every cycle. This requires a 3-bus structure with 2 input buses and one output bus. One input bus and the output bus are also shared with the external routing interface for loading and storing external data to the register file Storage Structure Each execution unit requires a local register file, which must be fast (single-cycle latency) and also must support a significant number of registers. Since we are using a 3-18

26 bus structure, the register file must have 2 input and 1 output ports. A larger number of registers are required due to the limited access to memory in a stream architecture -- so all constants, results, and temporaries must be stored in the register file. In addition to local registers, many algorithms may require extended read-only storage for indexed constants or larger data structures. These can be stored in a block of readonly SRAM within each execution unit, into which indexed lookups may be performed. Data in this local memory must be transferred to registers before it can be operated on. These indexed memories can use a single input and output port. We have implemented a single register file with a variable number of 32-bit floating point registers. The instruction word allows up to 256 floating point registers in the future. The indexed memory could be larger (perhaps 1024 registers) Register Windowing To increase the number of effective registers, it is possible to use register windows in which parts of a single large register file are exposed depending on the current window location. For example, the register file could hold bit floating point registers but expose only 64-word windows, which reduces the complexity of the logic in the register file as well as reduces the width of register references in instruction words. Register windowing is typically used to prevent register spilling (that is preventing registers from being stored into main memory when a procedure is called). Register windowing is useful to a stream processor because the amount of temporary storage that is allowed by a fixed register file may be more limited than what is demanded by some applications. However, the addition of register windowing to the hardware design should not require large changes (the addition of register window manipulation instructions and compiler support). For this reason we have elected not to 19

27 support register windowing in the initial design, while it may be included in future revisions Pipeline The arithmetic units that are used within each execution unit include the components shown in Table 1. There are two different latencies for the units in the design, resulting in a complication when trying to pipeline the entire design. To create the pipeline for these units, we were faced with the choice of either fixing the pipeline length at the longer of the two latencies (4 execute stages) and incurring a performance penalty for the slower instructions, or supporting both latencies in separate pipelines and automatically resolving the structural hazards arising from this configuration. We chose a design that allows for multiple variable pipeline lengths. Table 1: Functional Unit Latencies Component Floating point ALU Floating point comparator Floating point fractional / integer converter Indexed load/store Latency 4 cycles 1 cycle 1 cycle 1 cycle To understand the design, it is necessary to consider all possible pipeline hazards. 1. Structural hazards: A structural hazard arises when a single hardware unit is needed simultaneously by multiple sources. By allowing multiple instruction latencies in the execution stage, we introduce a structural hazard on the writeback bus. 2. Control hazards: Control hazards arise when a conditional branch instruction causes instructions that have already been partially completed to be invalidated. The stream processor does not implement conditional branch instructions, so there is no chance of a control hazard arising. 20

28 3. Data hazards: There are 3 variants of data hazards: read after write (RAW), write after read (WAR), and write after write (WAW). Without further restrictions on instruction completion order, each of these types of data hazards are possible in the stream processor. To avoid WAR and WAW hazards, it is sufficient to enforce inorder writeback of instructions. Avoiding RAW hazards requires an additional hazard detection unit. To automatically solve all hazard detection problems (structural and data hazards), we have elected to design a hardware writeback queue to track instructions in the pipeline Writeback Queue The simplicity of the execution unit design ensures that the only state changing operation that can occur is the modification of a register. Registers can only be modified in the common writeback stage for all instructions. The writeback queue makes sure that 1) instructions are issued only when there is no conflict on the writeback bus and 2) instructions are issued only when there is no pending instruction that needs to write to one of the current read registers. Within the queue are a number of registers. Each register in the queue stores the 8-bit register as well as an enable bit for each pending writeback operation. The queue slots are numbered from 0 to 7, with register 0 containing the instruction that is currently present in the writeback stage. Newly issued instructions are inserted in the queue slot determined by the latency of the instruction. In our case, instructions are placed into the queue during the decode stage, so FPU instructions are added to slot 5 while single-cycle execute instructions are added to slot 2. The queue is a shift register with all registers being shifted towards 0. The stall signal is issued under two conditions: 1. When the desired register in the queue is already enabled a conflict has been detected on the bus. 21

29 2. When any enabled queue register contains a writeback register equal to one of the registers being read a data dependency needs to be avoided. The hardware complexity to detect the first condition is trivial. For the second condition, a comparator is required for every queue slot for every read register. So in the case of 8 queue slots and 2 possible read registers, the number of 8-bit comparators required is Multithreading The most important characteristic of the stream programming model is that each element can be operated on in parallel. To increase utilization of the pipeline and reduce the possibility of processor stalls, it is possible to take advantage of this characteristic by implementing multiple threads of execution within a single execution unit. There are several methods of implementing multithreading, ranging from very fine-grained (switching threads on every clock cycle) to coarse-grained (switching threads after a fixed time interval or only on costly operations like cache miss reads). Since our design does not permit any operation latencies of greater than 7 cycles, an efficient means of decreasing the effective latency is to support fine-grained interleaved multithreading. Interleaved multithreading is supported by duplicating the state-holding components in the design for each thread so in our case, each register file needs to be duplicated for each thread. In addition, the control hardware needs to broadcast the currently active thread for each of the pipeline stages decode, execute, and writeback. 22

30 Instruction Figure 15: Interleaved Multithreading To support multithreading, we have added an additional queue to the instruction controller that stores the active thread for each active cycle. When control signals are broadcast to each execution unit, the thread identifier is passed along so that the appropriate state-holding component is selected at the execution unit. Note that a different thread can be active for each pipeline stage, so when the signals are broadcast the thread identifier needs to be duplicated for each of the decode, execute, and writeback stages of the pipeline Conditionals Conditionals are handled in various ways in SIMD processors. In general, conditionals cannot be implemented directly in a SIMD processor because the processor array is required to remain in lockstep (ie. each processor executes the same instruction on a given clock cycle). Traditionally, conditionals have been implemented by masking the execution of processors for which a condition fails. We have chosen to implement conditionals using the method of predication, in which all processors first evaluate all conditional branches and then conditionally write back their results into the register file. 23

31 Conditionals are handled in the Sh intermediate representation through conditional assignment of 0.0 or 1.0 to a floating point value based on a comparison of two other floating point values. Although conditional support using floating point values is easy to implement, this approach leads to additional usage of floating point registers to store Boolean results. A more efficient alternative would be to support a new series of singlebit registers for storing Boolean values. We have decided to implement boolean operations using dedicated Boolean registers. This reduces the number of floating point instructions and relieves floating point registers from storing Boolean values Instruction Set The instruction set includes all the operation shown in Table 2 with the given latency and functional unit. Table 2: Instruction Set Instruction Latency Functional Operation Unit NOP 4 BUS None ADD 7 FPU Rd R1 + R2 MUL 7 FPU Rd R1 * R2 DIV 7 FPU Rd R1 / R2 SQRT 7 FPU Rd sqrt(r1) INT 4 FRAC Rd floor(r1) FRAC 4 FRAC Rd R1 floor(r1) CMP 4 CMP Cd compare(r1, R2) COND 4 BUS Rd {R1 when C1} LDI 4 IDX Rd index(r1) STI 4 IDX index(r1) R2 GET 4 BUS {Rd route(imm) : C1} PUT 4 BUS {route(imm) Rd : C1} CON 4 BUS Rd scatter(imm) 24

32 Operation Notation: 1. Rd denotes a destination register. R1 and R2 denote the two register inputs. 2. Cd denotes a destination Boolean register. C1 denotes an input Boolean register. 3. imm denotes an input immediate value. 4. compare(): This method will perform the comparison operation described below. 5. index(): perform a lookup in the index constant table using the value of the given register as the lookup. 6. route(): This operation invokes the functionality of the inter- execution unit routing network. In particular, it is used to compact outgoing data when some units do not produce output and to route incoming data to the execution units needing new elements. Functional units are the components that are used during the execute (EX) stage of pipeline execution. The functional units that are available are shown in Table 3. Table 3: Functional Unit Identification Functional Unit BUS FPU FRAC CMP IDX Description The operation uses only the internal execution unit buses The operation invokes the floating point FPU The operation uses the integer/fractional part extraction unit The operation uses the floating point comparator unit The operation uses the index constant table Instruction Format The generic instruction bit layout is: 25

33 Table 4: General instruction format Opcode Rd or Cd (Unused) Negate 1 R1 Negate 2 R (5) (8) 18 (1) 17 (1) (8) 8 (1) 7..0 (8) The Negate 1 and Negate 2 bits allow for free asynchronous negation of input values. This allows any instruction to negate its inputs. Instructions that use conditional registers have the following format: Table 5: Conditional Register format Opcode Rd (Unused) C1 reg C1 imm condition (5) (8) (2) (5) (3) 8..0 (9) The condition register field determines which Boolean register should be used. Up to 32 independent Boolean registers are supported by this encoding. Each Boolean register stores 8 bits indicating the decoded condition codes. The possible conditions are shown in [table]. Table 6: Condition Codes Value Condition 0 Never satisfied 1 Equal 2 Not Equal 3 Less than 4 Less or equal 5 Greater than or equal 6 Greater than 7 Always satisfied 26

34 This choice of instruction format allows scalability in the number of operations (up to 32) and scalability in the size of the local register file (up to 256 entries) Instruction Controller The active kernel will be stored in SRAM local to the instruction control hardware. The control hardware will be a simple state machine that reads and issues instructions. Because there are no branching instructions, and all data dependencies are resolved at compile time, the controller will simply step through the program and issue one instruction per cycle. Instructions will be decoded at the individual execution units, so the common instruction controller will consist mainly of the state machine. The alternative is to decode instructions at the execution units, which would introduce a significant number of new signals to be routed from the instruction controller to the execution units. We have elected not to use this approach. 3.3 Memory System This section describes the on-chip memory system including the stream register file (SRF), SRF controller, and inter-element routing network. The memory system was designed with a number of design goals in mind. These include: Direct support for the stream and channel primitives exposed by Sh Bandwidth scalability to support an array of execution units Hardware support for conditional stream operations The stream register file is intended to store input, output, and temporary channels that are allocated during execution. A channel is a sequence of independent 32-bit floating point 27

35 values. For operations on records containing many 32-bit values, the records must first be decomposed into channels and then stored into the SRF separately (for example, vectors <x,y,z> must be decomposed into an x-, y-, and z- channel before processing). The purpose of the stream register file is to provide high bandwidth on-chip storage component so that intermediate results can be stored and accessed quickly by the execution array. The SRF is composed of several memory banks that permit parallel access. The number of banks in the SRF is coupled to the number of processors that need to make parallel access to the SRF so for a processing array containing 8 processors, it is necessary to provide 8 equal-sized banks in the SRF memory Channel Descriptors Only a single channel in the SRF can be accessed at once. When the channel is accessed, any number of processors can retrieve data from the channel. Since channels are always accessed in sequential order, it is necessary to keep track of the current address for access within that channel. This is done by storing an array of channel descriptors. Each channel descriptor consists of the current address and final address within that channel. When the current address reaches the final address, the channel is empty and any further references to that channel are rejected with an error. 28

36 Figure 16: Description of channels in SRAM Routing interface For interfacing with the routing hardware, it is sufficient for the stream register file to store the current channel address for each channel. When a routed memory read is performed, the memory interface decodes the low-order bits of the channel address to form an array of offset bits for each bank of the SRF. These offset bits store whether a given bank of memory will need to read from the current or next line of the bank. When a memory read is performed, the number of execution units requesting new data determines the amount that the memory address needs to be increased, and the next sequence of banks are used to retrieve the memory. 29

37 3.4 Routing Network The routing network is what connects all the execution units to the static RAM banks. The same network is used to fulfill both roles of providing the processing elements with data and with storing the output back into RAM. It has been designed such that it can operate in either direction, from memory to PEs or from PEs to memory. The complexity involved with constructing such a network is based on three different aspects. Firstly, the network has to be parameterizable as it has to be able to expand to the number of PEs used during a simulation. Secondly, each PE has been designed with conditional read and conditional write commands. These conditions are based on the condition codes set during the execution of the program. Thus the network has to handle reading or writing a number of data elements that could be fewer than the most number of processing elements available. Lastly, the network has been constructed so that each PE does not need to worry about the next memory location from which data will be available. This is especially important given that each PE would not necessarily be reading data from memory whenever a read instruction is invoked. Thus, when a PE executes a read instruction, it shouldn t need to worry about this issue. To accomplish the tasks mentioned above, the network has to be divided into two components, a barrel shifter and a compactor. The barrel shifter reorders the data being transferred. The compactor is what removes any gaps that may exist in the data. This is useful when certain PEs do not produce any data. This way data being stored to memory does not have any gaps between them. The same compactor can be used to introduce gaps between data elements when certain PEs do not need to read any new data. The compactor and the barrel shifter together provide data to each PE and write back any output produced while taking care of the three above mentioned issues Barrel Shifter 30

38 The barrel shifter is quite a simple design. This is important when reading in data since the first physical processing element might not be the first one that is actually accepting data. It is also important during the write back stage because it allows the network to reorder data to the next available slot in SRAM. It is basically built up of many 2x2 multiplexers. The whole barrel shifter has log 2 (n) stages. This ensures that when the number of processing elements grows, the barrel shifter does not grow out of proportion. Below is a diagram of a barrel shifter with four inputs. mux mux mux mux mux mux mux mux Figure 17: Barrel Shifter network Compactor The compactor is a network built that compress data elements when writing to memory to remove any gaps. As mentioned before, these gaps could be produced when processing element executes a put instruction. Since each put instruction is based on a condition code, if the condition is not satisfied, the processing element will not output data. Therefore, there could be gaps between data elements written back to memory. Similarly, during a get instruction, if the condition code isn t satisfied, the processing element will not read in new data. Thus, the same network is used to produce gaps 31

39 between data elements to account for PEs that do not request new data. Below is a diagram of the compactor. It is also a network that has log 2 (n) stages. mux mux mux Figure 18: Compactor network 3.5 Assembler An assembler has been written in C++ for converting programs written using the xstream instruction set into xstream proprietary assembly code. The assembler also needs to be used in conjunction with the data assembler. In our project, we have not completed an external memory interface that would allow us to refresh the data values stored in SRAM. Therefore, when we run simulations, we need to assemble the data values as well. Thus, the channel descriptions have to be specified and the input data has to be specified. The final output of both the assembler and the data assembler is the opcode for the instructions and the actual values that are stored in the SRAM banks. 32

40 3.6 External Memory Overview The memory subsystem was designed to work as a three-layer system as depicted in Figure 1. The SDRAM model was obtained from McGill University. Attached to it is the controller that handles row, column and bank decoding, RAS and CAS timings as well as the initial Load and Precharge times. The controller supports page reads and writes for high bandwidth block transfers, which are necessary given the streaming architecture of the processor. SDR Model mc-ram bus Controller P-mC bus Processor Figure 19: An overview of the external memory architecture SDRAM Model The SDRAM model has three basic modes. First is the uninitialized mode before values have been loaded from disk. Next is the powerup mode where power is applied to 33

41 memory. Last is the command mode which indicates that the memory is ready to receive commands. Figure 20: SDRAM Component The list of commands that can be issued to the SDRAM model are presented below. They are issued to the SDRAM model via specific selections of the scsn, scasn, srasn, sa[10] and swen [19]. Table 7: Command Listings Command DESL NOP READ WRITE ACT PRE PALL MRS REF Description No command Do nothing Read burst of data Write burst of data Activate Row Precharge selected bank Precharge all Mode Register Select Refresh 34

42 3.6.3 Memory Controller The block diagram for the memory controller is given above. The controller operates on two clocks, each corresponding to a separate bus. It accepts a clock input from the processor and has an arbitrarily set clock (via generics) for the controller->memory bus (not shown). This bus has been set to 100 MHz. The procclk signal is the clock between the controller and the processor. The memory controller takes in a 23 bit address of which 13 are row and 8 are column addresses. The remaining two refer to the bank. Page size was set to 256 words. The data bus itself is 64 bits wide which means that the RAM has a total capacity of 64 MB. The r and w lines correspond to the read and write functions. Blocklen is the length of the burst transfer and the resetn is the negated reset line. The receive_rdy line indicates that the processor is ready to receive data from the controller while the blockrdy line indicates that a block is ready for burst transfer. The ready line indicates that the memory controller is ready to receive data from the processor. Figure 21: Memory Controller Component The controller states and their functions are presented below: 35

43 mload mprecharge mmrs mrefresh mnop mactive mwritede lay r=0,w=1 READ/WRITE r=1,w=0 mread mwrite mpagereaddelay while bytecount < pagesize mpagewritedelay while bytecount < pagesize Figure 22: Memory Controller State Diagram 36

44 Table 8: Description of states in memory controller diagram State mload mprecharge mmrs mrefresh mactive mread mwrite mnop mpagereaddelay mpagewritedelay mwritewait Description Initial state. Loading from disk Precharge delay Mode register select state Row refreshing Row ready for column access Currently in Burst read Currently in Burst write No operation Page read delay Page write delay Delay while internal controller buffer is filled 37

45 4. Applications There are three applications that we wrote to test the xstream processor. The three different programs show that the instruction set of the xstream processor has been designed to be accommodating for most tasks that can be broken up into parallel streams. Some of the tests were fairly simple as they only tested the basic functionality of the processor. This was sufficient to ensure that we met all the requirements. The first is a fairly simple program that normalized a set of vectors. The second test was to generate a fractal image. The third test is a much larger project that is still in the works. It involves carrying out the process of rasterization of an image. 4.1 Normalize The task was to normalize a set of vectors. We setup three input channels and three output channels. The output provided after the end of the simulation gave us the normalized vectors that we verified with a calculator. 4.2 Fractal Generation The second algorithm was used to produce a Mandelbrot fractal image. This is an algorithm where more data set produced was larger than the data set used as input. The output data was used to generate the final image using a python program. Below you can see the image generated by the simulation. 38

46 Figure 23: Fractal image generated by simulation 4.3 Rasterization Lastly, a rasterization program was attempted on the processor as well. Rasterization is the process where an image is parsed to determine what colour each pixel should have on the screen. There are many ways of carrying out the process. The algorithm used in our case was one that was better suited to stream processing architectures. The entire process is broken up into different stages, each of which is simulated on the processor. This is one of the better examples that the xstream processor is fit to use for real world applications. Rasterization is carried out everyday by the graphics cards that we use in our computers. 39

A Parameterizable SIMD Stream Processor

A Parameterizable SIMD Stream Processor A Parameterizable SIMD Stream Processor Asad Munshi Alexander Wong Andrew Clinton Sherman Braganza William Bishop Michael McCool amunshi a28wong ajclinto sbraganz wdbishop mmccool @uwaterloo.ca University

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

PIPELINE AND VECTOR PROCESSING

PIPELINE AND VECTOR PROCESSING PIPELINE AND VECTOR PROCESSING PIPELINING: Pipelining is a technique of decomposing a sequential process into sub operations, with each sub process being executed in a special dedicated segment that operates

More information

COPROCESSOR APPROACH TO ACCELERATING MULTIMEDIA APPLICATION [CLAUDIO BRUNELLI, JARI NURMI ] Processor Design

COPROCESSOR APPROACH TO ACCELERATING MULTIMEDIA APPLICATION [CLAUDIO BRUNELLI, JARI NURMI ] Processor Design COPROCESSOR APPROACH TO ACCELERATING MULTIMEDIA APPLICATION [CLAUDIO BRUNELLI, JARI NURMI ] Processor Design Lecture Objectives Background Need for Accelerator Accelerators and different type of parallelizm

More information

ARM ARCHITECTURE. Contents at a glance:

ARM ARCHITECTURE. Contents at a glance: UNIT-III ARM ARCHITECTURE Contents at a glance: RISC Design Philosophy ARM Design Philosophy Registers Current Program Status Register(CPSR) Instruction Pipeline Interrupts and Vector Table Architecture

More information

DHANALAKSHMI SRINIVASAN INSTITUTE OF RESEARCH AND TECHNOLOGY. Department of Computer science and engineering

DHANALAKSHMI SRINIVASAN INSTITUTE OF RESEARCH AND TECHNOLOGY. Department of Computer science and engineering DHANALAKSHMI SRINIVASAN INSTITUTE OF RESEARCH AND TECHNOLOGY Department of Computer science and engineering Year :II year CS6303 COMPUTER ARCHITECTURE Question Bank UNIT-1OVERVIEW AND INSTRUCTIONS PART-B

More information

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures Storage I/O Summary Storage devices Storage I/O Performance Measures» Throughput» Response time I/O Benchmarks» Scaling to track technological change» Throughput with restricted response time is normal

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

UNIT I (Two Marks Questions & Answers)

UNIT I (Two Marks Questions & Answers) UNIT I (Two Marks Questions & Answers) Discuss the different ways how instruction set architecture can be classified? Stack Architecture,Accumulator Architecture, Register-Memory Architecture,Register-

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per

More information

Vertex Shader Design I

Vertex Shader Design I The following content is extracted from the paper shown in next page. If any wrong citation or reference missing, please contact ldvan@cs.nctu.edu.tw. I will correct the error asap. This course used only

More information

CPE300: Digital System Architecture and Design

CPE300: Digital System Architecture and Design CPE300: Digital System Architecture and Design Fall 2011 MW 17:30-18:45 CBC C316 Pipelining 11142011 http://www.egr.unlv.edu/~b1morris/cpe300/ 2 Outline Review I/O Chapter 5 Overview Pipelining Pipelining

More information

Advanced issues in pipelining

Advanced issues in pipelining Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one

More information

2 MARKS Q&A 1 KNREDDY UNIT-I

2 MARKS Q&A 1 KNREDDY UNIT-I 2 MARKS Q&A 1 KNREDDY UNIT-I 1. What is bus; list the different types of buses with its function. A group of lines that serves as a connecting path for several devices is called a bus; TYPES: ADDRESS BUS,

More information

Adapted from David Patterson s slides on graduate computer architecture

Adapted from David Patterson s slides on graduate computer architecture Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual

More information

CAD for VLSI 2 Pro ject - Superscalar Processor Implementation

CAD for VLSI 2 Pro ject - Superscalar Processor Implementation CAD for VLSI 2 Pro ject - Superscalar Processor Implementation 1 Superscalar Processor Ob jective: The main objective is to implement a superscalar pipelined processor using Verilog HDL. This project may

More information

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level Parallelism (ILP) &

More information

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

ASSEMBLY LANGUAGE MACHINE ORGANIZATION ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction

More information

Lecture 25: Board Notes: Threads and GPUs

Lecture 25: Board Notes: Threads and GPUs Lecture 25: Board Notes: Threads and GPUs Announcements: - Reminder: HW 7 due today - Reminder: Submit project idea via (plain text) email by 11/24 Recap: - Slide 4: Lecture 23: Introduction to Parallel

More information

Chapter 2: Memory Hierarchy Design (Part 3) Introduction Caches Main Memory (Section 2.2) Virtual Memory (Section 2.4, Appendix B.4, B.

Chapter 2: Memory Hierarchy Design (Part 3) Introduction Caches Main Memory (Section 2.2) Virtual Memory (Section 2.4, Appendix B.4, B. Chapter 2: Memory Hierarchy Design (Part 3) Introduction Caches Main Memory (Section 2.2) Virtual Memory (Section 2.4, Appendix B.4, B.5) Memory Technologies Dynamic Random Access Memory (DRAM) Optimized

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) Chentao Wu 吴晨涛 Associate Professor Dept. of Computer Science and Engineering Shanghai Jiao Tong University SEIEE Building

More information

Co-processor Math Processor. Richa Upadhyay Prabhu. NMIMS s MPSTME February 9, 2016

Co-processor Math Processor. Richa Upadhyay Prabhu. NMIMS s MPSTME February 9, 2016 8087 Math Processor Richa Upadhyay Prabhu NMIMS s MPSTME richa.upadhyay@nmims.edu February 9, 2016 Introduction Need of Math Processor: In application where fast calculation is required Also where there

More information

Hardware Design I Chap. 10 Design of microprocessor

Hardware Design I Chap. 10 Design of microprocessor Hardware Design I Chap. 0 Design of microprocessor E-mail: shimada@is.naist.jp Outline What is microprocessor? Microprocessor from sequential machine viewpoint Microprocessor and Neumann computer Memory

More information

VLSI Signal Processing

VLSI Signal Processing VLSI Signal Processing Programmable DSP Architectures Chih-Wei Liu VLSI Signal Processing Lab Department of Electronics Engineering National Chiao Tung University Outline DSP Arithmetic Stream Interface

More information

The Nios II Family of Configurable Soft-core Processors

The Nios II Family of Configurable Soft-core Processors The Nios II Family of Configurable Soft-core Processors James Ball August 16, 2005 2005 Altera Corporation Agenda Nios II Introduction Configuring your CPU FPGA vs. ASIC CPU Design Instruction Set Architecture

More information

s complement 1-bit Booth s 2-bit Booth s

s complement 1-bit Booth s 2-bit Booth s ECE/CS 552 : Introduction to Computer Architecture FINAL EXAM May 12th, 2002 NAME: This exam is to be done individually. Total 6 Questions, 100 points Show all your work to receive partial credit for incorrect

More information

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors William Stallings Computer Organization and Architecture 8 th Edition Chapter 14 Instruction Level Parallelism and Superscalar Processors What is Superscalar? Common instructions (arithmetic, load/store,

More information

Computer organization by G. Naveen kumar, Asst Prof, C.S.E Department 1

Computer organization by G. Naveen kumar, Asst Prof, C.S.E Department 1 Pipelining and Vector Processing Parallel Processing: The term parallel processing indicates that the system is able to perform several operations in a single time. Now we will elaborate the scenario,

More information

Pipelining and Vector Processing

Pipelining and Vector Processing Pipelining and Vector Processing Chapter 8 S. Dandamudi Outline Basic concepts Handling resource conflicts Data hazards Handling branches Performance enhancements Example implementations Pentium PowerPC

More information

LECTURE 5: MEMORY HIERARCHY DESIGN

LECTURE 5: MEMORY HIERARCHY DESIGN LECTURE 5: MEMORY HIERARCHY DESIGN Abridged version of Hennessy & Patterson (2012):Ch.2 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive

More information

The University of Adelaide, School of Computer Science 13 September 2018

The University of Adelaide, School of Computer Science 13 September 2018 Computer Architecture A Quantitative Approach, Sixth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per

More information

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Organization Part II Ujjwal Guin, Assistant Professor Department of Electrical and Computer Engineering Auburn University, Auburn,

More information

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed

More information

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design Edited by Mansour Al Zuair 1 Introduction Programmers want unlimited amounts of memory with low latency Fast

More information

Divide: Paper & Pencil

Divide: Paper & Pencil Divide: Paper & Pencil 1001 Quotient Divisor 1000 1001010 Dividend -1000 10 101 1010 1000 10 Remainder See how big a number can be subtracted, creating quotient bit on each step Binary => 1 * divisor or

More information

Digital System Design Using Verilog. - Processing Unit Design

Digital System Design Using Verilog. - Processing Unit Design Digital System Design Using Verilog - Processing Unit Design 1.1 CPU BASICS A typical CPU has three major components: (1) Register set, (2) Arithmetic logic unit (ALU), and (3) Control unit (CU) The register

More information

SAE5C Computer Organization and Architecture. Unit : I - V

SAE5C Computer Organization and Architecture. Unit : I - V SAE5C Computer Organization and Architecture Unit : I - V UNIT-I Evolution of Pentium and Power PC Evolution of Computer Components functions Interconnection Bus Basics of PCI Memory:Characteristics,Hierarchy

More information

Laboratory Exercise 3 Comparative Analysis of Hardware and Emulation Forms of Signed 32-Bit Multiplication

Laboratory Exercise 3 Comparative Analysis of Hardware and Emulation Forms of Signed 32-Bit Multiplication Laboratory Exercise 3 Comparative Analysis of Hardware and Emulation Forms of Signed 32-Bit Multiplication Introduction All processors offer some form of instructions to add, subtract, and manipulate data.

More information

Multiple Instruction Issue. Superscalars

Multiple Instruction Issue. Superscalars Multiple Instruction Issue Multiple instructions issued each cycle better performance increase instruction throughput decrease in CPI (below 1) greater hardware complexity, potentially longer wire lengths

More information

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan Course II Parallel Computer Architecture Week 2-3 by Dr. Putu Harry Gunawan www.phg-simulation-laboratory.com Review Review Review Review Review Review Review Review Review Review Review Review Processor

More information

Like scalar processor Processes individual data items Item may be single integer or floating point number. - 1 of 15 - Superscalar Architectures

Like scalar processor Processes individual data items Item may be single integer or floating point number. - 1 of 15 - Superscalar Architectures Superscalar Architectures Have looked at examined basic architecture concepts Starting with simple machines Introduced concepts underlying RISC machines From characteristics of RISC instructions Found

More information

Instruction Pipelining Review

Instruction Pipelining Review Instruction Pipelining Review Instruction pipelining is CPU implementation technique where multiple operations on a number of instructions are overlapped. An instruction execution pipeline involves a number

More information

Design and Implementation of a Super Scalar DLX based Microprocessor

Design and Implementation of a Super Scalar DLX based Microprocessor Design and Implementation of a Super Scalar DLX based Microprocessor 2 DLX Architecture As mentioned above, the Kishon is based on the original DLX as studies in (Hennessy & Patterson, 1996). By: Amnon

More information

Getting CPI under 1: Outline

Getting CPI under 1: Outline CMSC 411 Computer Systems Architecture Lecture 12 Instruction Level Parallelism 5 (Improving CPI) Getting CPI under 1: Outline More ILP VLIW branch target buffer return address predictor superscalar more

More information

Lecture-13 (ROB and Multi-threading) CS422-Spring

Lecture-13 (ROB and Multi-threading) CS422-Spring Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

PowerPC 740 and 750

PowerPC 740 and 750 368 floating-point registers. A reorder buffer with 16 elements is used as well to support speculative execution. The register file has 12 ports. Although instructions can be executed out-of-order, in-order

More information

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK SUBJECT : CS6303 / COMPUTER ARCHITECTURE SEM / YEAR : VI / III year B.E. Unit I OVERVIEW AND INSTRUCTIONS Part A Q.No Questions BT Level

More information

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367

More information

Computer Architecture Review. Jo, Heeseung

Computer Architecture Review. Jo, Heeseung Computer Architecture Review Jo, Heeseung Computer Abstractions and Technology Jo, Heeseung Below Your Program Application software Written in high-level language System software Compiler: translates HLL

More information

The Bifrost GPU architecture and the ARM Mali-G71 GPU

The Bifrost GPU architecture and the ARM Mali-G71 GPU The Bifrost GPU architecture and the ARM Mali-G71 GPU Jem Davies ARM Fellow and VP of Technology Hot Chips 28 Aug 2016 Introduction to ARM Soft IP ARM licenses Soft IP cores (amongst other things) to our

More information

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

Lecture 7: Pipelining Contd. More pipelining complications: Interrupts and Exceptions

Lecture 7: Pipelining Contd. More pipelining complications: Interrupts and Exceptions Lecture 7: Pipelining Contd. Kunle Olukotun Gates 302 kunle@ogun.stanford.edu http://www-leland.stanford.edu/class/ee282h/ 1 More pipelining complications: Interrupts and Exceptions Hard to handle in pipelined

More information

CS152 Computer Architecture and Engineering VLIW, Vector, and Multithreaded Machines

CS152 Computer Architecture and Engineering VLIW, Vector, and Multithreaded Machines CS152 Computer Architecture and Engineering VLIW, Vector, and Multithreaded Machines Assigned April 7 Problem Set #5 Due April 21 http://inst.eecs.berkeley.edu/~cs152/sp09 The problem sets are intended

More information

! Readings! ! Room-level, on-chip! vs.!

! Readings! ! Room-level, on-chip! vs.! 1! 2! Suggested Readings!! Readings!! H&P: Chapter 7 especially 7.1-7.8!! (Over next 2 weeks)!! Introduction to Parallel Computing!! https://computing.llnl.gov/tutorials/parallel_comp/!! POSIX Threads

More information

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S Lecture 6 MIPS R4000 and Instruction Level Parallelism Computer Architectures 521480S Case Study: MIPS R4000 (200 MHz, 64-bit instructions, MIPS-3 instruction set) 8 Stage Pipeline: first half of fetching

More information

ForwardCom: An open-standard instruction set for high-performance microprocessors. Agner Fog

ForwardCom: An open-standard instruction set for high-performance microprocessors. Agner Fog ForwardCom: An open-standard instruction set for high-performance microprocessors Agner Fog June 25, 2016 Contents 1 Introduction 4 1.1 Highlights............................... 4 1.2 Background..............................

More information

COSC4201. Prof. Mokhtar Aboelaze York University

COSC4201. Prof. Mokhtar Aboelaze York University COSC4201 Chapter 3 Multi Cycle Operations Prof. Mokhtar Aboelaze York University Based on Slides by Prof. L. Bhuyan (UCR) Prof. M. Shaaban (RTI) 1 Multicycle Operations More than one function unit, each

More information

Chapter 13 Reduced Instruction Set Computers

Chapter 13 Reduced Instruction Set Computers Chapter 13 Reduced Instruction Set Computers Contents Instruction execution characteristics Use of a large register file Compiler-based register optimization Reduced instruction set architecture RISC pipelining

More information

Foundations of Computer Systems

Foundations of Computer Systems 18-600 Foundations of Computer Systems Lecture 4: Floating Point Required Reading Assignment: Chapter 2 of CS:APP (3 rd edition) by Randy Bryant & Dave O Hallaron Assignments for This Week: Lab 1 18-600

More information

Superscalar SMIPS Processor

Superscalar SMIPS Processor Superscalar SMIPS Processor Group 2 Qian (Vicky) Liu Cliff Frey 1 Introduction Our project is the implementation of a superscalar processor that implements the SMIPS specification. Our primary goal is

More information

Chapter 5. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 5 <1>

Chapter 5. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 5 <1> Chapter 5 Digital Design and Computer Architecture, 2 nd Edition David Money Harris and Sarah L. Harris Chapter 5 Chapter 5 :: Topics Introduction Arithmetic Circuits umber Systems Sequential Building

More information

Cache Organizations for Multi-cores

Cache Organizations for Multi-cores Lecture 26: Recap Announcements: Assgn 9 (and earlier assignments) will be ready for pick-up from the CS front office later this week Office hours: all day next Tuesday Final exam: Wednesday 13 th, 7:50-10am,

More information

A Library of Parameterized Floating-point Modules and Their Use

A Library of Parameterized Floating-point Modules and Their Use A Library of Parameterized Floating-point Modules and Their Use Pavle Belanović and Miriam Leeser Department of Electrical and Computer Engineering Northeastern University Boston, MA, 02115, USA {pbelanov,mel}@ece.neu.edu

More information

Floating-Point Data Representation and Manipulation 198:231 Introduction to Computer Organization Lecture 3

Floating-Point Data Representation and Manipulation 198:231 Introduction to Computer Organization Lecture 3 Floating-Point Data Representation and Manipulation 198:231 Introduction to Computer Organization Instructor: Nicole Hynes nicole.hynes@rutgers.edu 1 Fixed Point Numbers Fixed point number: integer part

More information

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Hydra is a 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software

More information

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 23 Memory Systems

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 23 Memory Systems EE382 (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 23 Memory Systems Mattan Erez The University of Texas at Austin EE382: Principles of Computer Architecture, Fall 2011 -- Lecture

More information

5008: Computer Architecture

5008: Computer Architecture 5008: Computer Architecture Chapter 2 Instruction-Level Parallelism and Its Exploitation CA Lecture05 - ILP (cwliu@twins.ee.nctu.edu.tw) 05-1 Review from Last Lecture Instruction Level Parallelism Leverage

More information

CS 1013 Advance Computer Architecture UNIT I

CS 1013 Advance Computer Architecture UNIT I CS 1013 Advance Computer Architecture UNIT I 1. What are embedded computers? List their characteristics. Embedded computers are computers that are lodged into other devices where the presence of the computer

More information

CPE300: Digital System Architecture and Design

CPE300: Digital System Architecture and Design CPE300: Digital System Architecture and Design Fall 2011 MW 17:30-18:45 CBC C316 Arithmetic Unit 10032011 http://www.egr.unlv.edu/~b1morris/cpe300/ 2 Outline Recap Chapter 3 Number Systems Fixed Point

More information

An introduction to SDRAM and memory controllers. 5kk73

An introduction to SDRAM and memory controllers. 5kk73 An introduction to SDRAM and memory controllers 5kk73 Presentation Outline (part 1) Introduction to SDRAM Basic SDRAM operation Memory efficiency SDRAM controller architecture Conclusions Followed by part

More information

Execution/Effective address

Execution/Effective address Pipelined RC 69 Pipelined RC Instruction Fetch IR mem[pc] NPC PC+4 Instruction Decode/Operands fetch A Regs[rs]; B regs[rt]; Imm sign extended immediate field Execution/Effective address Memory Ref ALUOutput

More information

Review: MIPS Organization

Review: MIPS Organization 1 MIPS Arithmetic Review: MIPS Organization Processor Memory src1 addr 5 src2 addr 5 dst addr 5 write data Register File registers ($zero - $ra) bits src1 data src2 data read/write addr 1 1100 2 30 words

More information

ForwardCom: An open-standard instruction set for high-performance microprocessors. Agner Fog

ForwardCom: An open-standard instruction set for high-performance microprocessors. Agner Fog ForwardCom: An open-standard instruction set for high-performance microprocessors Agner Fog November 3, 2017 Contents 1 Introduction 2 1.1 Highlights.......................................... 2 1.2 Background.........................................

More information

1 Tomasulo s Algorithm

1 Tomasulo s Algorithm Design of Digital Circuits (252-0028-00L), Spring 2018 Optional HW 4: Out-of-Order Execution, Dataflow, Branch Prediction, VLIW, and Fine-Grained Multithreading uctor: Prof. Onur Mutlu TAs: Juan Gomez

More information

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy Chapter 5B Large and Fast: Exploiting Memory Hierarchy One Transistor Dynamic RAM 1-T DRAM Cell word access transistor V REF TiN top electrode (V REF ) Ta 2 O 5 dielectric bit Storage capacitor (FET gate,

More information

CPU Pipelining Issues

CPU Pipelining Issues CPU Pipelining Issues What have you been beating your head against? This pipe stuff makes my head hurt! L17 Pipeline Issues & Memory 1 Pipelining Improve performance by increasing instruction throughput

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

Floating Point Arithmetic

Floating Point Arithmetic Floating Point Arithmetic Clark N. Taylor Department of Electrical and Computer Engineering Brigham Young University clark.taylor@byu.edu 1 Introduction Numerical operations are something at which digital

More information

SH4 RISC Microprocessor for Multimedia

SH4 RISC Microprocessor for Multimedia SH4 RISC Microprocessor for Multimedia Fumio Arakawa, Osamu Nishii, Kunio Uchiyama, Norio Nakagawa Hitachi, Ltd. 1 Outline 1. SH4 Overview 2. New Floating-point Architecture 3. Length-4 Vector Instructions

More information

ECE 154A Introduction to. Fall 2012

ECE 154A Introduction to. Fall 2012 ECE 154A Introduction to Computer Architecture Fall 2012 Dmitri Strukov Lecture 10 Floating point review Pipelined design IEEE Floating Point Format single: 8 bits double: 11 bits single: 23 bits double:

More information

Appendix C: Pipelining: Basic and Intermediate Concepts

Appendix C: Pipelining: Basic and Intermediate Concepts Appendix C: Pipelining: Basic and Intermediate Concepts Key ideas and simple pipeline (Section C.1) Hazards (Sections C.2 and C.3) Structural hazards Data hazards Control hazards Exceptions (Section C.4)

More information

General Purpose Signal Processors

General Purpose Signal Processors General Purpose Signal Processors First announced in 1978 (AMD) for peripheral computation such as in printers, matured in early 80 s (TMS320 series). General purpose vs. dedicated architectures: Pros:

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

Universität Dortmund. ARM Architecture

Universität Dortmund. ARM Architecture ARM Architecture The RISC Philosophy Original RISC design (e.g. MIPS) aims for high performance through o reduced number of instruction classes o large general-purpose register set o load-store architecture

More information

Lecture 29 Review" CPU time: the best metric" Be sure you understand CC, clock period" Common (and good) performance metrics"

Lecture 29 Review CPU time: the best metric Be sure you understand CC, clock period Common (and good) performance metrics Be sure you understand CC, clock period Lecture 29 Review Suggested reading: Everything Q1: D[8] = D[8] + RF[1] + RF[4] I[15]: Add R2, R1, R4 RF[1] = 4 I[16]: MOV R3, 8 RF[4] = 5 I[17]: Add R2, R2, R3

More information

EITF20: Computer Architecture Part2.2.1: Pipeline-1

EITF20: Computer Architecture Part2.2.1: Pipeline-1 EITF20: Computer Architecture Part2.2.1: Pipeline-1 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle

More information

Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation

Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation Mainstream Computer System Components CPU Core 2 GHz - 3.0 GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation One core or multi-core (2-4) per chip Multiple FP, integer

More information

Mainstream Computer System Components

Mainstream Computer System Components Mainstream Computer System Components Double Date Rate (DDR) SDRAM One channel = 8 bytes = 64 bits wide Current DDR3 SDRAM Example: PC3-12800 (DDR3-1600) 200 MHz (internal base chip clock) 8-way interleaved

More information

Computer Systems Architecture I. CSE 560M Lecture 5 Prof. Patrick Crowley

Computer Systems Architecture I. CSE 560M Lecture 5 Prof. Patrick Crowley Computer Systems Architecture I CSE 560M Lecture 5 Prof. Patrick Crowley Plan for Today Note HW1 was assigned Monday Commentary was due today Questions Pipelining discussion II 2 Course Tip Question 1:

More information

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information