Like scalar processor Processes individual data items Item may be single integer or floating point number. - 1 of 15 - Superscalar Architectures

Superscalar Architectures Have looked at examined basic architecture concepts Starting with simple machines Introduced concepts underlying RISC machines From characteristics of RISC instructions Found with proper control of instruction cycle Could be performing multiple phases of instruction cycle simultaneously Denote such control as pipelining and structure as pipeline If we view pipeline as an abstraction can ask Is it possible to instantiate multiple instances of a pipeline Have such instances operating simultaneously Answer to question is yes Contingent upon being able to decompose program Into multiple independent threads of execution Be able to handle any interthread data dependencies Similar to scheme used in managing single pipe Such an architecture denoted superscalar We will examine that shortly Let s first trace the development path Beginning with basic architecture Some Architectures Scalar to Superscalar Scalar Processor Simplest class of processors denoted scalar processors Simply put scalar processor Processes individual data items Item may be single integer or floating point number Operations may involve multiple single data items Add x to y Still scalar operation Pipelined Processor Like scalar processor Processes individual data items Item may be single integer or floating point number - 1 of 15 -

Difference Multiple instructions processed simultaneously Accomplished when different instructions In different stages of instruction cycle at same time Superpiplined Processor Similar in architecture to basic pipelined scalar architecture Difference Takes advantage of fact that many pipeline stages Require less than half clock cycle to complete By doubling internal clock frequency Execute two tasks in one external clock cycle Vector Processor Also known as SIMD Single Instruction Multiple Data Processes aggregates of data items of same type Aggregates may comprise integer or floating point numbers Single instruction simultaneously operates on multiple data items Examples include Vectors Vector is collection of numbers or objects Arrays or matrices DSP FFT Superscalar Processor Which now brings us to the superscalar processor Superscalar processor is loose mixture of scalar and vector processor With pipelining added From scalar architecture Each instruction processes single data item From vector architecture Redundant functional units within CPU With pipelining (single pipeline) Can execute instructions concurrently We can have several instructions in pipeline at same time One may be doing arithmetic - 2 of 15 -

Second being decoded Third being fetched Key is instructions enter pipe in strict program order In absence of hazards One instruction enters and one leaves Each clock cycle Implication is Maximum throughput is one instruction per clock cycle If One is Good Can take more aggressive approach Extend processor to support multiple processing units Handle several instructions in parallel at each processing stage Such a design supports multiple independent pipelines Several instructions can start execution on same clock cycle Called multiple-issue Each pipeline comprises multiple stages Can see that such a scheme permits each to Simultaneously handle multiple instructions In various stages of completion FDEWN Processing aggregate can Process multiple instruction streams simultaneously Achieve throughput greater than one instruction per clock cycle Such scheme exploits what is called instruction-level parallelism Many modern machines utilize such a scheme Such architectures called superscalar Superscalar processor can simultaneously fetch multiple instructions From such a set attempts to find instructions that are independent Can therefore be executed in parallel Via the constituent pipelines Integer vs Floating Point for example In earlier discussion of pipelining Introduced idea of instruction queue - 3 of 15 -

To fully utilize instruction queue efficiently Processor should or must be able to fetch Multiple instructions at same time from cache For superscalar processors Such a scheme is essential Multiple-issue operation requires Wider cache bus Multiple execution units Separate execution units typically support Integer instructions Floating point instructions The relative performance of four of the architectures Illustrated in following diagram t2 t3 t4 t5 t6 t7 t8 t9 0 1 2 3 4 5 6 F1 D1 E1 W1 F3 D3 E3 W3 F4 D4 E4 W4 Scalar F1 D1 E1 W1 F3 D3 E3 W3 Pipelined Scalar F4 D4 E4 W4 F1 D1 E1 W1 Superpipelined Scalar F3 D3 E3 W3 F4 D4 E4 W4 F1 D1 E1 W1 Superscalar Pipelined F3 D3 E3 W3 F4 D4 E4 W4 Examining superpipelined and superscalar flow above Superpipelined We see two pipelined stages per clock cycle Alternately Functions performed at each stage Split into two nonoverlapping parts - 4 of 15 -

Each can execute on one half clock cycle Such a pipeline said to be of degree 2 Superscalar Design capable of executing two instances of each stage in parallel Higher degree implementations of each possible Both illustrated above have same number of instructions Executing at same time in steady state Superpipelined lags at Program start Branches Pipelined Functional High-level organization of superscalar machine Units Given in accompanying diagram Two functional units Integer Floating Point Integer Register File Memory Floating Point Register File Finer grained view gives following simple architecture Cache Fetch Unit Instruction Queue Integer Unit Dispatch Unit Buffers Write Back Floating Point Unit There s Danger Ahead Earlier studies learn about impact on performance of hazards In superscalar processor Affects more pronounced - 5 of 15 -

Some tools to address problems Dependent instructions identified and handled as discussed earlier Reorder instructions Instruction ordering may be different from original code Unnecessary dependencies may be eliminated by Using additional registers Renaming register references Utilize traditional branch prediction methods To improve efficiency In design above Compiler could seek to interleave Integer and floating point instructions Could facilitate keeping both execution units busy most of time Assume three clock cycles for floating point operation With no hazards Could achieve instruction flow as follows t2 t3 t4 t5 t6 t7 I1 (Fadd) I2 (Iadd) I3 (Fsub) I4 (Isub) F1 D1 E1 E1 E1 W1 F3 D3 E3 E3 E3 W3 F4 D4 E4 W4 Let s now look at some of fundamental limitations Facing systems implementing instruction-level parallelism Will limit discussion to two integer execution units The base case will have no data dependencies Data Dependencies With no dependencies as above t2 t3 t4 Will have instruction flow as shown I1 (Iadd) F1 D1 E1 W1 Problem 1 - Inter Instruction Data Dependencies Observe in diagram above instructions Dispatched in program order Executed out of order - 6 of 15 - I2 (Isub)

Can lead to problems if dependencies exist Amongst instructions If I2 depends upon results from I1 I2 must be delayed until I1 completes If handled as discussed earlier No reason for problems or potential delays With an inter instruction data dependency flow diagram becomes t2 t3 t4 t5 I1 (Iadd) F1 D1 E1 W1 I2 (Isub) Instruction I2 needed a data word that I1 was modifying Must delay by one cycle Problem 2 - Procedural Dependencies Consider branch instruction Instructions following branch statement Whether taken or not Depend upon the branch Cannot be executed until the branch executed Such a situation leads to following instruction flow t2 t3 t4 t5 t6 t7 t8 I1 (Iadd) F1 D1 E1 W1 I2 (ibr) I3 (...) F3 D3 E3 W3 I4 (...) I5 (...) F4 D4 E4 W4 F5 D5 E5 W5 I6 (...) F6 D6 E6 W6 Problem 3 - Resource Conflict Resource conflict arises when two or more instructions Require same resource at same time Examples Memories - 7 of 15 -

Caches Busses Execution units From pipeline s perspective Resource conflict and data dependency similar Major difference Can duplicate resource Cannot duplicate true data dependency Can also address by pipelining execution unit Resource conflict affects instruction flow as shown in next diagram t2 t3 t4 t5 I1 (Iadd) F1 D1 E1 W1 I2 (Isub) Problem 4 - Exceptions to Normal Flow Exceptions present a bit more of challenge Examples include Bus error Illegal opcode Divide by zero Interrupt 4a. Exceptions Consider instruction flow above Let I2 depend upon the results of I1 I2 completes at time t4 If I1 causes exception Program in inconsistent state PC pointing to instruction that cause exception Succeeding instruction(s) executed to completion If permitted Processor has imprecise exceptions To ensure consistent state under exception Instruction results must be written in program order Here we must delay I2 write until time t6-8 of 15 -

Implies that execution unit must retain result until t6 Which delays acceptance of I4 for execution until t6 Illustrated in next diagram t2 t3 t4 t5 t6 t7 I1 (Fadd) I2 (Iadd) I3 (Fsub) I4 (Isub) F1 D1 E1 E1 E1 W1 F3 D3 E3 E3 E3 W3 F4 D4 E4 W4 If exception occurs during an execution All subsequent partially executed instructions Must be discarded Called precise exception 4b. Interrupts When interrupt occurs Dispatch unit must stop reading new instructions From instruction queue Instructions remaining in queue must be discarded All pending instructions Continue to completion Number of pending instructions Not deterministic Thus get variation in time Of response to interrupt Problem 5 - Execution Completion If out of order execution can be utilized Execution unit can be permitted to execute instructions As soon as possible However constraint of program order execution to support precise exceptions Creates conflict Precise exception constant state before and after exception Examining root cause of problem Delayed storage of results - 9 of 15 -

Demands on resources blocked by such delays From previous work Incorporate temporary storage To briefly hold result so can be written later Enable resource to be freed up Contents of temp registers Appropriately transferred to permanent registers later To ensure desired program order Can modify instruction flow from above to reflect temporary storage t2 t3 t4 t5 t6 t7 I1 (Fadd) I2 (Iadd) I3 (Fsub) I4 (Isub) F1 D1 E1 E1 E1 W1 F2 D2 E2 TW2 W2 F3 D3 E3 E3 E3 W3 F4 D4 TW4 E4 W4 Temp register now serves as surrogate for permanent register Treated as permanent register until such time transfer occurs Assume target for W2 is register R1 Temp register TW2 assumes that identity during times t6 and t7 Its content would be forwarded to a target that needed value from R1 During that time Such a technique called register renaming Surrogate identity only applies to Instructions that follow I2 in program order All instructions that may need R1 which precede I2 Will use real R1 It s value will not have changed yet When out-of-order execution permitted Control must be implemented To ensure in-order commitment Such a scheme utilizes queue called reorder buffer - 10 of 15 -

Determines which instruction(s) should be committed next Instructions entered in queue in program order When instruction reaches head of queue and execution completed Results copied from temp registers to permanent Instruction removed from queue All resources assigned to instruction released Includes temp registers Instruction designated retired Important inference Because instruction can only be retired from head of queue All preceding dispatched instructions must have been retired Thus instructions may complete execution out of order However are retired in program order Problem 6 - Dispatching As we find in various operating system scheduling algorithms When instruction dispatched Dispatch unit must ensure that all necessary resource are available Examples Any temp registers Proper location in reorder buffer Possibility of deadlock exists Consider following sequence of events I 2 delayed because of cache miss Delay for I2 results in delay for I4 Since integer execution unit allocated to I2 I4 shares no resources with I5 I5 dispatched and executed Result temporarily held I 2 finishes I4 finishes All is good Now consider slight change in events assume single temp register I 2 delayed because of cache miss - 11 of 15 -

Delay for I2 results in delay for I4 Since integer execution unit allocated to I2 I4 needs temp register I5 dispatched and executed Result temporarily held in temp register I 2 finishes I4 blocked because of need for temp register Temp register will be freed with I5 retired I5 cannot be retired and free resources until I4 retired All is not good we have a deadlock A Quick Look at a SPARC The SPARC architecture is basis for processors Used in Sun workstations One implementation called UltraSPARC II SPARC Scalable Processor ARChitecture SPARC architecture first announced in 1987 Based upon ideas developed at Berkeley in early 1980s Specification controlled by international consortium Introduced new versions every few years Latest version SPARC-V9 Now with Sun being sold future of SPARC uncertain Is RISC style architecture Main building block of UltraSPARC II given as follows - 12 of 15 -

System Bus External Cache Instructions Data Prefetch and Dispatch Unit Memory Management Unit Load queue Store queue I-Cache itlb dtlb D-Cache Instruction Buffer Floating Point Registers Floating Point Execution Unit Integer Registers Integer Execution Unit - 13 of 15 -

Two execution units Comprise two parallel pipelines of six stages each Stages 0..3 Intended to perform operation specified by instruction Stages 4-5 Check for exceptions Store result of instruction Pipeline organization given as Integer Pipes Fetch Decode Inst. Buffer Group E C N1 N2 Ck W E C N1 N2 Ck W R E1 E2 E3 Ck W R E1 E2 E3 Ck W E - Execute N - Delay Ck - Check W - Write R - Register Floating Point Pipes Observe four instruction cycles in parallel Prefetch and Dispatch unit Fetches upto 4 instructions from instruction cache Partially decodes them Determines if instruction is branch Uses 4 state branch prediction As discussed earlier For each 4 instructions in instruction cache Tag field called next address Value recorded in next address field Stores result in instruction buffer Instruction buffer will hold upto 12 instructions Grouping Block Selects group of up to 4 instructions To be executed in parallel Dispatches them to integer and floating point units E Stage This is the execute stage ALU operations et performed - 14 of 15 -

C Stage Parts of buffer Transferred to register file called Annex Contains temp registers used in renaming Generation of condition codes takes place flags set N1 and N2 Stages These are simply delays Intent is to equalize temporal length Integer and floating point pipelines Ck Stage Checks for exception conditions and interrupts R Stage Register operands fetched in floating point unit E1..E3 Stage Floating point operation executed W Stage Results written to either registers of cache Summary Reviewed several common architectures Based upon scalar processing We have extended concept of pipelining To multiple pipes Introduced and explored superscalar architecture Based upon such an approach Identified key elements of superscalar processor Identified the strengths and limitations of superscalar architecture Discussed at high level Methods for dealing with limitations Examined at high level real-world superscalar implementation UltraSPARC II - 15 of 15 -