Like scalar processor Processes individual data items Item may be single integer or floating point number. - 1 of 15 - Superscalar Architectures

Similar documents
Basic concepts UNIT III PIPELINING. Data hazards. Instruction hazards. Influence on instruction sets. Data path and control considerations

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

What is Superscalar? CSCI 4717 Computer Architecture. Why the drive toward Superscalar? What is Superscalar? (continued) In class exercise

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

Superscalar Processors

RISC & Superscalar. COMP 212 Computer Organization & Architecture. COMP 212 Fall Lecture 12. Instruction Pipeline no hazard.

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors

Chapter 9. Pipelining Design Techniques

TECH. CH14 Instruction Level Parallelism and Superscalar Processors. What is Superscalar? Why Superscalar? General Superscalar Organization

EC 513 Computer Architecture

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version:

Chapter 13 Reduced Instruction Set Computers

UNIT- 5. Chapter 12 Processor Structure and Function

Superscalar Machines. Characteristics of superscalar processors

CS 152 Computer Architecture and Engineering. Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming

Pipelining and Vector Processing

Multiple Instruction Issue. Superscalars

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Advanced issues in pipelining

Dynamic Scheduling. CSE471 Susan Eggers 1

Case Study IBM PowerPC 620

Main Points of the Computer Organization and System Software Module

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Superscalar Processors

A superscalar machine is one in which multiple instruction streams allow completion of more than one instruction per cycle.

Pentium IV-XEON. Computer architectures M

Announcements. ECE4750/CS4420 Computer Architecture L11: Speculative Execution I. Edward Suh Computer Systems Laboratory

Pipelining and Vector Processing

CS311 Lecture: Pipelining, Superscalar, and VLIW Architectures revised 10/18/07

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Organisasi Sistem Komputer

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

In-order vs. Out-of-order Execution. In-order vs. Out-of-order Execution

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

Limitations of Scalar Pipelines

Handout 2 ILP: Part B

ARSITEKTUR SISTEM KOMPUTER. Wayan Suparta, PhD 17 April 2018

Advanced Processor Architecture

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Copyright 2012, Elsevier Inc. All rights reserved.

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Chapter 9 Pipelining. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan

Multithreaded Processors. Department of Electrical Engineering Stanford University

November 7, 2014 Prediction

Advanced processor designs

Chapter 06: Instruction Pipelining and Parallel Processing. Lesson 14: Example of the Pipelined CISC and RISC Processors

Chapter. Out of order Execution

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

E0-243: Computer Architecture

Programmazione Avanzata

EECS 322 Computer Architecture Superpipline and the Cache

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

Superscalar Processors Ch 14

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

PIPELINE AND VECTOR PROCESSING

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

Computer organization by G. Naveen kumar, Asst Prof, C.S.E Department 1

UNIT 3 - Basic Processing Unit

Lecture 26: Parallel Processing. Spring 2018 Jason Tang

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

5008: Computer Architecture

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Performance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model.

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Superscalar Processors Ch 13. Superscalar Processing (5) Computer Organization II 10/10/2001. New dependency for superscalar case? (8) Name dependency

Module 4c: Pipelining

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Static vs. Dynamic Scheduling

Instruction Pipelining Review

Final Lecture. A few minutes to wrap up and add some perspective

The Processor: Instruction-Level Parallelism

CS425 Computer Systems Architecture

Hardware-Based Speculation

Intel released new technology call P6P

Computer Architecture and Engineering CS152 Quiz #3 March 22nd, 2012 Professor Krste Asanović

Pentium 4 Processor Block Diagram

CS152 Computer Architecture and Engineering. Complex Pipelines

Branch Prediction & Speculative Execution. Branch Penalties in Modern Pipelines

Pipeline Processors David Rye :: MTRX3700 Pipelining :: Slide 1 of 15

Microprocessor Architecture Dr. Charles Kim Howard University

Processing Unit CS206T

Lecture 9: Dynamic ILP. Topics: out-of-order processors (Sections )

Chapter 12. CPU Structure and Function. Yonsei University

Chapter 4 The Processor (Part 4)

PowerPC 740 and 750

ECEC 355: Pipelining

CS Computer Architecture

Instruction Level Parallelism

Hardware-based Speculation

CHAPTER 8: CPU and Memory Design, Enhancement, and Implementation

Computer and Hardware Architecture I. Benny Thörnberg Associate Professor in Electronics

Multi-cycle Instructions in the Pipeline (Floating Point)

COMPUTER ORGANIZATION AND DESI

SUPERSCALAR AND VLIW PROCESSORS

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5

Transcription:

Superscalar Architectures Have looked at examined basic architecture concepts Starting with simple machines Introduced concepts underlying RISC machines From characteristics of RISC instructions Found with proper control of instruction cycle Could be performing multiple phases of instruction cycle simultaneously Denote such control as pipelining and structure as pipeline If we view pipeline as an abstraction can ask Is it possible to instantiate multiple instances of a pipeline Have such instances operating simultaneously Answer to question is yes Contingent upon being able to decompose program Into multiple independent threads of execution Be able to handle any interthread data dependencies Similar to scheme used in managing single pipe Such an architecture denoted superscalar We will examine that shortly Let s first trace the development path Beginning with basic architecture Some Architectures Scalar to Superscalar Scalar Processor Simplest class of processors denoted scalar processors Simply put scalar processor Processes individual data items Item may be single integer or floating point number Operations may involve multiple single data items Add x to y Still scalar operation Pipelined Processor Like scalar processor Processes individual data items Item may be single integer or floating point number - 1 of 15 -

Difference Multiple instructions processed simultaneously Accomplished when different instructions In different stages of instruction cycle at same time Superpiplined Processor Similar in architecture to basic pipelined scalar architecture Difference Takes advantage of fact that many pipeline stages Require less than half clock cycle to complete By doubling internal clock frequency Execute two tasks in one external clock cycle Vector Processor Also known as SIMD Single Instruction Multiple Data Processes aggregates of data items of same type Aggregates may comprise integer or floating point numbers Single instruction simultaneously operates on multiple data items Examples include Vectors Vector is collection of numbers or objects Arrays or matrices DSP FFT Superscalar Processor Which now brings us to the superscalar processor Superscalar processor is loose mixture of scalar and vector processor With pipelining added From scalar architecture Each instruction processes single data item From vector architecture Redundant functional units within CPU With pipelining (single pipeline) Can execute instructions concurrently We can have several instructions in pipeline at same time One may be doing arithmetic - 2 of 15 -

Second being decoded Third being fetched Key is instructions enter pipe in strict program order In absence of hazards One instruction enters and one leaves Each clock cycle Implication is Maximum throughput is one instruction per clock cycle If One is Good Can take more aggressive approach Extend processor to support multiple processing units Handle several instructions in parallel at each processing stage Such a design supports multiple independent pipelines Several instructions can start execution on same clock cycle Called multiple-issue Each pipeline comprises multiple stages Can see that such a scheme permits each to Simultaneously handle multiple instructions In various stages of completion FDEWN Processing aggregate can Process multiple instruction streams simultaneously Achieve throughput greater than one instruction per clock cycle Such scheme exploits what is called instruction-level parallelism Many modern machines utilize such a scheme Such architectures called superscalar Superscalar processor can simultaneously fetch multiple instructions From such a set attempts to find instructions that are independent Can therefore be executed in parallel Via the constituent pipelines Integer vs Floating Point for example In earlier discussion of pipelining Introduced idea of instruction queue - 3 of 15 -

To fully utilize instruction queue efficiently Processor should or must be able to fetch Multiple instructions at same time from cache For superscalar processors Such a scheme is essential Multiple-issue operation requires Wider cache bus Multiple execution units Separate execution units typically support Integer instructions Floating point instructions The relative performance of four of the architectures Illustrated in following diagram t2 t3 t4 t5 t6 t7 t8 t9 0 1 2 3 4 5 6 F1 D1 E1 W1 F3 D3 E3 W3 F4 D4 E4 W4 Scalar F1 D1 E1 W1 F3 D3 E3 W3 Pipelined Scalar F4 D4 E4 W4 F1 D1 E1 W1 Superpipelined Scalar F3 D3 E3 W3 F4 D4 E4 W4 F1 D1 E1 W1 Superscalar Pipelined F3 D3 E3 W3 F4 D4 E4 W4 Examining superpipelined and superscalar flow above Superpipelined We see two pipelined stages per clock cycle Alternately Functions performed at each stage Split into two nonoverlapping parts - 4 of 15 -

Each can execute on one half clock cycle Such a pipeline said to be of degree 2 Superscalar Design capable of executing two instances of each stage in parallel Higher degree implementations of each possible Both illustrated above have same number of instructions Executing at same time in steady state Superpipelined lags at Program start Branches Pipelined Functional High-level organization of superscalar machine Units Given in accompanying diagram Two functional units Integer Floating Point Integer Register File Memory Floating Point Register File Finer grained view gives following simple architecture Cache Fetch Unit Instruction Queue Integer Unit Dispatch Unit Buffers Write Back Floating Point Unit There s Danger Ahead Earlier studies learn about impact on performance of hazards In superscalar processor Affects more pronounced - 5 of 15 -

Some tools to address problems Dependent instructions identified and handled as discussed earlier Reorder instructions Instruction ordering may be different from original code Unnecessary dependencies may be eliminated by Using additional registers Renaming register references Utilize traditional branch prediction methods To improve efficiency In design above Compiler could seek to interleave Integer and floating point instructions Could facilitate keeping both execution units busy most of time Assume three clock cycles for floating point operation With no hazards Could achieve instruction flow as follows t2 t3 t4 t5 t6 t7 I1 (Fadd) I2 (Iadd) I3 (Fsub) I4 (Isub) F1 D1 E1 E1 E1 W1 F3 D3 E3 E3 E3 W3 F4 D4 E4 W4 Let s now look at some of fundamental limitations Facing systems implementing instruction-level parallelism Will limit discussion to two integer execution units The base case will have no data dependencies Data Dependencies With no dependencies as above t2 t3 t4 Will have instruction flow as shown I1 (Iadd) F1 D1 E1 W1 Problem 1 - Inter Instruction Data Dependencies Observe in diagram above instructions Dispatched in program order Executed out of order - 6 of 15 - I2 (Isub)

Can lead to problems if dependencies exist Amongst instructions If I2 depends upon results from I1 I2 must be delayed until I1 completes If handled as discussed earlier No reason for problems or potential delays With an inter instruction data dependency flow diagram becomes t2 t3 t4 t5 I1 (Iadd) F1 D1 E1 W1 I2 (Isub) Instruction I2 needed a data word that I1 was modifying Must delay by one cycle Problem 2 - Procedural Dependencies Consider branch instruction Instructions following branch statement Whether taken or not Depend upon the branch Cannot be executed until the branch executed Such a situation leads to following instruction flow t2 t3 t4 t5 t6 t7 t8 I1 (Iadd) F1 D1 E1 W1 I2 (ibr) I3 (...) F3 D3 E3 W3 I4 (...) I5 (...) F4 D4 E4 W4 F5 D5 E5 W5 I6 (...) F6 D6 E6 W6 Problem 3 - Resource Conflict Resource conflict arises when two or more instructions Require same resource at same time Examples Memories - 7 of 15 -

Caches Busses Execution units From pipeline s perspective Resource conflict and data dependency similar Major difference Can duplicate resource Cannot duplicate true data dependency Can also address by pipelining execution unit Resource conflict affects instruction flow as shown in next diagram t2 t3 t4 t5 I1 (Iadd) F1 D1 E1 W1 I2 (Isub) Problem 4 - Exceptions to Normal Flow Exceptions present a bit more of challenge Examples include Bus error Illegal opcode Divide by zero Interrupt 4a. Exceptions Consider instruction flow above Let I2 depend upon the results of I1 I2 completes at time t4 If I1 causes exception Program in inconsistent state PC pointing to instruction that cause exception Succeeding instruction(s) executed to completion If permitted Processor has imprecise exceptions To ensure consistent state under exception Instruction results must be written in program order Here we must delay I2 write until time t6-8 of 15 -

Implies that execution unit must retain result until t6 Which delays acceptance of I4 for execution until t6 Illustrated in next diagram t2 t3 t4 t5 t6 t7 I1 (Fadd) I2 (Iadd) I3 (Fsub) I4 (Isub) F1 D1 E1 E1 E1 W1 F3 D3 E3 E3 E3 W3 F4 D4 E4 W4 If exception occurs during an execution All subsequent partially executed instructions Must be discarded Called precise exception 4b. Interrupts When interrupt occurs Dispatch unit must stop reading new instructions From instruction queue Instructions remaining in queue must be discarded All pending instructions Continue to completion Number of pending instructions Not deterministic Thus get variation in time Of response to interrupt Problem 5 - Execution Completion If out of order execution can be utilized Execution unit can be permitted to execute instructions As soon as possible However constraint of program order execution to support precise exceptions Creates conflict Precise exception constant state before and after exception Examining root cause of problem Delayed storage of results - 9 of 15 -

Demands on resources blocked by such delays From previous work Incorporate temporary storage To briefly hold result so can be written later Enable resource to be freed up Contents of temp registers Appropriately transferred to permanent registers later To ensure desired program order Can modify instruction flow from above to reflect temporary storage t2 t3 t4 t5 t6 t7 I1 (Fadd) I2 (Iadd) I3 (Fsub) I4 (Isub) F1 D1 E1 E1 E1 W1 F2 D2 E2 TW2 W2 F3 D3 E3 E3 E3 W3 F4 D4 TW4 E4 W4 Temp register now serves as surrogate for permanent register Treated as permanent register until such time transfer occurs Assume target for W2 is register R1 Temp register TW2 assumes that identity during times t6 and t7 Its content would be forwarded to a target that needed value from R1 During that time Such a technique called register renaming Surrogate identity only applies to Instructions that follow I2 in program order All instructions that may need R1 which precede I2 Will use real R1 It s value will not have changed yet When out-of-order execution permitted Control must be implemented To ensure in-order commitment Such a scheme utilizes queue called reorder buffer - 10 of 15 -

Determines which instruction(s) should be committed next Instructions entered in queue in program order When instruction reaches head of queue and execution completed Results copied from temp registers to permanent Instruction removed from queue All resources assigned to instruction released Includes temp registers Instruction designated retired Important inference Because instruction can only be retired from head of queue All preceding dispatched instructions must have been retired Thus instructions may complete execution out of order However are retired in program order Problem 6 - Dispatching As we find in various operating system scheduling algorithms When instruction dispatched Dispatch unit must ensure that all necessary resource are available Examples Any temp registers Proper location in reorder buffer Possibility of deadlock exists Consider following sequence of events I 2 delayed because of cache miss Delay for I2 results in delay for I4 Since integer execution unit allocated to I2 I4 shares no resources with I5 I5 dispatched and executed Result temporarily held I 2 finishes I4 finishes All is good Now consider slight change in events assume single temp register I 2 delayed because of cache miss - 11 of 15 -

Delay for I2 results in delay for I4 Since integer execution unit allocated to I2 I4 needs temp register I5 dispatched and executed Result temporarily held in temp register I 2 finishes I4 blocked because of need for temp register Temp register will be freed with I5 retired I5 cannot be retired and free resources until I4 retired All is not good we have a deadlock A Quick Look at a SPARC The SPARC architecture is basis for processors Used in Sun workstations One implementation called UltraSPARC II SPARC Scalable Processor ARChitecture SPARC architecture first announced in 1987 Based upon ideas developed at Berkeley in early 1980s Specification controlled by international consortium Introduced new versions every few years Latest version SPARC-V9 Now with Sun being sold future of SPARC uncertain Is RISC style architecture Main building block of UltraSPARC II given as follows - 12 of 15 -

System Bus External Cache Instructions Data Prefetch and Dispatch Unit Memory Management Unit Load queue Store queue I-Cache itlb dtlb D-Cache Instruction Buffer Floating Point Registers Floating Point Execution Unit Integer Registers Integer Execution Unit - 13 of 15 -

Two execution units Comprise two parallel pipelines of six stages each Stages 0..3 Intended to perform operation specified by instruction Stages 4-5 Check for exceptions Store result of instruction Pipeline organization given as Integer Pipes Fetch Decode Inst. Buffer Group E C N1 N2 Ck W E C N1 N2 Ck W R E1 E2 E3 Ck W R E1 E2 E3 Ck W E - Execute N - Delay Ck - Check W - Write R - Register Floating Point Pipes Observe four instruction cycles in parallel Prefetch and Dispatch unit Fetches upto 4 instructions from instruction cache Partially decodes them Determines if instruction is branch Uses 4 state branch prediction As discussed earlier For each 4 instructions in instruction cache Tag field called next address Value recorded in next address field Stores result in instruction buffer Instruction buffer will hold upto 12 instructions Grouping Block Selects group of up to 4 instructions To be executed in parallel Dispatches them to integer and floating point units E Stage This is the execute stage ALU operations et performed - 14 of 15 -

C Stage Parts of buffer Transferred to register file called Annex Contains temp registers used in renaming Generation of condition codes takes place flags set N1 and N2 Stages These are simply delays Intent is to equalize temporal length Integer and floating point pipelines Ck Stage Checks for exception conditions and interrupts R Stage Register operands fetched in floating point unit E1..E3 Stage Floating point operation executed W Stage Results written to either registers of cache Summary Reviewed several common architectures Based upon scalar processing We have extended concept of pipelining To multiple pipes Introduced and explored superscalar architecture Based upon such an approach Identified key elements of superscalar processor Identified the strengths and limitations of superscalar architecture Discussed at high level Methods for dealing with limitations Examined at high level real-world superscalar implementation UltraSPARC II - 15 of 15 -