Superscalar Processor Design Superscalar Organization Virendra Singh Indian Institute of Science Bangalore virendra@computer.org Lecture 26 SE-273: Processor Design
Super-scalar Organization Fetch Instruction buffer Decode Dispatch buffer Dispatch Issue Reservation station Execute finish Complete Re-order/Completion buffer Store buffer Retire Apr 14, 2008 SE-273@SERC 2
Dynamic Execution Core Dispatch buffer Register WB Dispatch ARF RRF Allocate ROB Entries Issue Execute Reservation station Forwarding Results to RS and RRF Re-Order Buffer ROB: Complete Managed as Queue Takeoff Dispatch Landing Completion Apr 14, 2008 SE-273@SERC 3
Dynamic Execution Core For Instruction dispatch There must be availability of Rename Register Reservation Station entry Re-order buffer entry If one of these three is not available then, instruction dispatching is stalled Dispatching is done via complex routing network (less expensive than full crossbar) Apr 14, 2008 SE-273@SERC 4
Reservation Station Dispatch slots Forwarding Buses Dispatch slots Forwarding Buses Busy Operand 1 valid Operand 2 valid Ready Tag Match Tag Match Tag Buses Tag Buses RS Entry Apr 14, 2008 SE-273@SERC 5
Reservation Station Reservation Station can be quite complex to implement, due to It must support many possible sources, including all the dispatch slots and forwarding logic Data routing network on its input side can be quite complex During waiting step, all operand fields of RS with pending operands must continuously compare their tags against potentially multiple tag buses This portion is known as wakeup logic Apr 14, 2008 SE-273@SERC 6
Reservation Station Allocate Unit Dispatching Issuing Unit Entry To be allocated Entry To be issued Busy Issuing Ready Apr 14, 2008 SE-273@SERC 7
Re-Order Buffer (ROB) Busy Issued Finished Inst. Address Rename Reg Spec. valid Next entry to be allocated (Tail pointer) Next instruction to complete (Head pointer) B I F IA RR S V 0 0 0 1 1 1 1 1 1 1 Apr 14, 2008 SE-273@SERC 8
Dynamic Instruction Scheduler Register File Register Update Operand Copying Data captured Scheduling window (Reservation Station) Forward and wakeup Functional Units Apr 14, 2008 SE-273@SERC 9
Dynamic Instruction Scheduler Non-data captured Scheduling window (Reservation Station) Wakeup Register File Forward Functional Units Apr 14, 2008 SE-273@SERC 10
Memory Data Flow techniques Move data between memory and RF Long Latency Bottleneck Operations Address Generation Address Translation Read/write data Apr 14, 2008 SE-273@SERC 11
Memory Data Flow techniques Ordering of memory access Out-of-order execution of Load instruction is a primary source of performance gain Two techniques Load bypassing Load forwarding In both these cases, earlier execution of lload instruction is achieved Apr 14, 2008 SE-273@SERC 12
Load/Store Processing Reservation Station Address generation Address translation Store Unit Load Unit Address generation Address translation Memory Access (finished) store buffer (Completed) store buffer Data Data Cache Address Apr 14, 2008 SE-273@SERC 13
Load Bypassing Out-of-order execution of Load instruction is a primary source of performance gain Dynamic instruction sequence.... Execute Load ahead of two Stores Store X..... Store Y.... Load Z Apr 14, 2008 SE-273@SERC 14
Load/Store Processing Reservation Station Address generation Address translation Store Unit Load Unit Address generation Address translation Memory Access (finished) store buffer (Completed) store buffer Data Data Cache Address Apr 14, 2008 SE-273@SERC 15
Load Bypassing Reservation Station Address generation Address translation Store Unit Tag match Load Unit Address generation Address translation Memory Access (finished) store buffer Data Address If no match: Update destination Reg. Data Address (Completed) store buffer Data Cache Match/ No match Apr 14, 2008 SE-273@SERC 16
Load Forwarding Dynamic instruction sequence.... Store X..... Store Y.... Load X Forward Store data directly to the Load Apr 14, 2008 SE-273@SERC 17
Load Forwarding Reservation Station Address generation Address translation Store Unit Tag match Load Unit Address generation Address translation Memory Access (finished) store buffer (Completed) store buffer Data Address match If match: Forward to destination Reg. Data Data Cache Address Match/ No match Apr 14, 2008 SE-273@SERC 18
Load/ Store Handling Reservation Station Store Load Unit Unit (finished) store buffer Data Address Tag match at store completion Data Address (Completed) store buffer Finished store buffer Address Data Data Cache Match/ No match If match: Flush aliased Load and all trailing instr. At completion: Update architected registers. Apr 14, 2008 SE-273@SERC 19
Advanced Instruction Flow techniques FA Mux FAR I-Cache +4 Branch History Table (BHT) Branch Target Address Cache (BTAC) BHT Prediction BHT Update BTAC Update BTAC Prediction BR Unit Apr 14, 2008 SE-273@SERC 20
Advanced Instruction Flow techniques PowerPC 604 used BHT and BTAC BTAC is 64 entry fully associative cache, and BHT is 512 entry direct mapped table Both the BTAC and BHT are accessed during the fetch stage using current instruction fetch address in the PC BTAC respond in one cycle and BHT respond in two cycles Apr 14, 2008 SE-273@SERC 21
Advanced Instruction Flow techniques If a hit occurs in the BTAC, indicating the presence of the branch instruction in current fetch group, a predict taken occurs and branch target address retrieved from BTAC is used in the next fetch cycle PowerPC is 4 wide fetch, hence, there can be multiple branch instructions BTAC branch address is indexed by the fetch address contains the branch target address of the first branch instruction in the fetch group that is predicted taken During the second cycle, in decode stage, history bits retrieved from BHT are used to generate history based prediction on the same branch Apr 14, 2008 SE-273@SERC 22
Advanced Instruction Flow techniques If the prediction agrees with the taken prediction made by the BTAC, the earlier prediction is allowed to stand If BHT disagree, with the BTAC prediction, the BTAC prediction is annulled and fetching from the fall-through path, corresponding to not taken branch, is initiated BHT predict overrule BTAC prediction When branch is resolved, BHT is updated Based on updated content the BHT updates BTAC by either leaving an entry in the BTAC if it is to be taken next time, or deleting the entry from the BTAC if branch is not taken next time Apr 14, 2008 SE-273@SERC 23
Advanced Instruction Flow techniques Two level adaptive branch prediction Potentially achieve better than 95% accuracy Can adapt to changing dynamic context For two-level prediction, a set of history tables are used Pattern history table The context is determined by a specific pattern of recently executed branch stored Apr 14, 2008 SE-273@SERC 24
Advanced Register Data Flow techniques Instructions scheduled execution time is determined by the position in the DFG Lower bound on program execution time is height of the DFG True data dependency Bottleneck Can we go beyond? Apr 14, 2008 SE-273@SERC 25
Value Reuse Instruction Reuse Value Prediction Value locality: captures the empirical observation that a limited set of unique values constitute the majority of values produced and consumed by real processors. Like caches Two techniques to exploit value locality Non-speculative Speculative Apr 14, 2008 SE-273@SERC 26
Instruction Reuse Memoization short-circuiting complex computation by dynamically recording the outcome of such computation Subsequent instances can use such results by table lookup IR H/w implementation of memoization Value prediction forecast full 32/64 bits Require much wider history Apr 14, 2008 SE-273@SERC 27
Instruction Reuse Cause of Value locality General nature of implementation Data Redundancy Computed branches Base Reg in Load/Store Register Spill code Convergent algorithms Affected by compilers Apr 14, 2008 SE-273@SERC 28
Instruction Reuse The execution of an individual instruction or a set of instruction is stored in history structure that stores the result for future use These set of instructions can be defined by either control flow or data flow The history structure must have mechanism that guarantees that its contents remain coherent with subsequent program execution The history mechanism has a lookup mechanism that allows subsequent instances to check against the stored instances Apr 14, 2008 SE-273@SERC 29
Instruction Reuse A hit or match during this lookup triggers the reuse mechanism, which allows processor to skip execution of the reuse candidates The processor eliminates the structural and data dependencies caused reuse candidates and is able to fast forward to subsequent program instructions Apr 14, 2008 SE-273@SERC 30
Instruction Reuse Fetch Instruction Reuse candidate? No, reuse buffer miss Reuse test? Yes Reuse prior result Fail Succeed: preconditions match prior instances Execute instruction Record outcome Apr 14, 2008 SE-273@SERC 31
Instruction Reuse PC of reuse candidate Result V? PC tag Source Op1 Source Op2 Address PC Reg. File Source operands Reused result Compare? All store check for matching address and make them invalid Apr 14, 2008 SE-273@SERC 32
Reuse Mechanism Reuse candidates (weather individual instruction or group of instructions) must inject their results into processor s architectural state This needs addition of write port to already heavily ported physical register file Instruction wakeup and scheduling logic will have to be modified to accommodate reuse instructions with effectively zero cycles of result latency Apr 14, 2008 SE-273@SERC 33
Reuse Mechanism The reuse candidate must enter into processor s ROB in order to maintain support for precise exceptions, but must simultaneously bypass the issue queue or RS This nonstandard behaviour comes with additional control path complexity Reused memory instructions must still be tracked in the processor s load/store queue (LSQ) to mainatain correct memory reference ordering LSQ entries are typically updated after instruction issue based on address generated during execution, this may entail additional datapath and LSQ ports that may allow updates to occur from earlier pipeline stage Apr 14, 2008 SE-273@SERC 34
Value Prediction Classification Table (CT) Value Prediction Table (VPT) V Pred history PC of predicted V Value history instruction prediction outcome Predicted value Update value Apr 14, 2008 SE-273@SERC 35
Thank You Apr 14, 2008 SE-273@SERC 36