Superscalar Processors

Similar documents
Superscalar Machines. Characteristics of superscalar processors

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

Superscalar Processors Ch 14

CS425 Computer Systems Architecture

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

What is Superscalar? CSCI 4717 Computer Architecture. Why the drive toward Superscalar? What is Superscalar? (continued) In class exercise

Superscalar Processors Ch 13. Superscalar Processing (5) Computer Organization II 10/10/2001. New dependency for superscalar case? (8) Name dependency

A superscalar machine is one in which multiple instruction streams allow completion of more than one instruction per cycle.

Pentium IV-XEON. Computer architectures M

Hardware-based Speculation

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

Handout 2 ILP: Part B

Multiple Instruction Issue. Superscalars

Hyperthreading 3/25/2008. Hyperthreading. ftp://download.intel.com/technology/itj/2002/volume06issue01/art01_hyper/vol6iss1_art01.

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

The Processor: Instruction-Level Parallelism

Architectural Performance. Superscalar Processing. 740 October 31, i486 Pipeline. Pipeline Stage Details. Page 1

Case Study IBM PowerPC 620

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design


Processor (IV) - advanced ILP. Hwansoo Han

Superscalar Processors

Next Generation Technology from Intel Intel Pentium 4 Processor

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Multithreaded Processors. Department of Electrical Engineering Stanford University

Advanced Processor Architecture

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

CS 2410 Mid term (fall 2018)

Portland State University ECE 588/688. Cray-1 and Cray T3E

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Lec 25: Parallel Processors. Announcements

SUPERSCALAR AND VLIW PROCESSORS

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University

Advanced Instruction-Level Parallelism

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

CS146 Computer Architecture. Fall Midterm Exam

EC 513 Computer Architecture

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Pipelined Processor Design. EE/ECE 4305: Computer Architecture University of Minnesota Duluth By Dr. Taek M. Kwon

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

Portland State University ECE 587/687. The Microarchitecture of Superscalar Processors

Techniques for Mitigating Memory Latency Effects in the PA-8500 Processor. David Johnson Systems Technology Division Hewlett-Packard Company

CS 152, Spring 2011 Section 8

Processing Unit CS206T

Exploring the Effects of Hyperthreading on Scientific Applications

Keywords and Review Questions

The Pentium II/III Processor Compiler on a Chip

Parallelism. Execution Cycle. Dual Bus Simple CPU. Pipelining COMP375 1

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation.

TECH. CH14 Instruction Level Parallelism and Superscalar Processors. What is Superscalar? Why Superscalar? General Superscalar Organization

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.

Advanced Computer Architecture

Portland State University ECE 588/688. IBM Power4 System Microarchitecture

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

BOBCAT: AMD S LOW-POWER X86 PROCESSOR

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Microarchitecture Overview. Performance

Intel released new technology call P6P

IF1 --> IF2 ID1 ID2 EX1 EX2 ME1 ME2 WB. add $10, $2, $3 IF1 IF2 ID1 ID2 EX1 EX2 ME1 ME2 WB sub $4, $10, $6 IF1 IF2 ID1 ID2 --> EX1 EX2 ME1 ME2 WB

Pipelining and Vector Processing

Microarchitecture Overview. Performance

Instruction Pipelining

Uniprocessors. HPC Fall 2012 Prof. Robert van Engelen

Instruction Pipelining

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2

Computer Science 146. Computer Architecture

Like scalar processor Processes individual data items Item may be single integer or floating point number. - 1 of 15 - Superscalar Architectures

Lecture 26: Parallel Processing. Spring 2018 Jason Tang

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011

COMPUTER ORGANIZATION AND DESI

Lecture 9: Superscalar processing

Instruction Level Parallelism

CS 426 Parallel Computing. Parallel Computing Platforms

How to write powerful parallel Applications

IF1/IF2. Dout2[31:0] Data Memory. Addr[31:0] Din[31:0] Zero. Res ALU << 2. CPU Registers. extension. sign. W_add[4:0] Din[31:0] Dout[31:0] PC+4

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

Precise Exceptions and Out-of-Order Execution. Samira Khan

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

CS311 Lecture: Pipelining, Superscalar, and VLIW Architectures revised 10/18/07

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5

Course on Advanced Computer Architectures

Chapter 13 Reduced Instruction Set Computers

RISC & Superscalar. COMP 212 Computer Organization & Architecture. COMP 212 Fall Lecture 12. Instruction Pipeline no hazard.

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS 152, Spring 2012 Section 8

CS433 Homework 2 (Chapter 3)

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

Announcement. Computer Architecture (CSC-3501) Lecture 25 (24 April 2008) Chapter 9 Objectives. 9.2 RISC Machines

Chapter 4 The Processor (Part 4)

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

Multithreading Processors and Static Optimization Review. Adapted from Bhuyan, Patterson, Eggers, probably others

Hardware Speculation Support

Transcription:

Superscalar Processors Increasing pipeline length eventually leads to diminishing returns longer pipelines take longer to re-fill data and control hazards lead to increased overheads, removing any a performance advantage yet there is still space for more circuitry on-chip Feature size still falling Improved purity in crystals means larger die size possible An alternative is the multiple-issue issue processor i.e. issue more than one instruction each clock cycle that is, start more than one instruction Such processors are called Superscalar processors Slide 82 Characteristics of superscalar processors Superscalar processors issue more than one instruction each cycle the number of instructions issued will depend on the instructions s in the instruction stream instructions are often re-ordered to fit the processor architecture better typically, between 1 and 8 instructions can be issued each cycle The simplest possibility is a machine which issues one integer instruction one floating-point instruction per clock cycle but this only gives appreciable improvements if there are many FP F instructions to be executed. Slide 83 Page 1 1

Ordering instructions 1 Instructions are produced in a particular order by the compiler, but need not (necessarily) be executed in that order. They may be issued in order and completed in order issued in order, but completed out of order issued out-of of-order, order, but completed in order issued out of order and completed out of order. Slide 84 Ordering instructions 2 The best order for instructions in a particular superscalar architecture depends on the architecture itself the precise dependencies between instructions the actual order they are executed in may be set up by the compiler in which case it must know the architecture complex code-generation phase in compiler Can be very effective or the instructions may be reordered by the CPU itself many instructions are fetched, but the precise order of execution n is decided at execution time this is more common (currently) Slide 85 Page 2 2

Register renaming One of the main techniques for dealing with data hazards is register renaming registers are dynamically allocated by the hardware out of a register set which is larger than the programmer s s model of the architecture subsequently executing instructions use whichever register is correct c for them. Usually the actual register which a programmer s s register refers to is re-allocated from time to time Limits to multiple issue machines there are inherent limits to the extent to which sequential instructions can be executed concurrently these show up as data and control hazards there are limits to the amount of underlying hardware that can be produced these show up as structural hazards can better compilers help? Slide 86 Branch Prediction Since branches can cause major delays in pipelines, some machines attempt to predict what the outcome of a branch will be Like the 68040, the Pentium attempts to limit the problems which branches cause by fetching instructions from both the target address and the branch address However, it predicts which will occur from what has happened previously and (speculatively) executes from that address. Other techniques can use the compiler to predict whether the branch is more or less likely to be taken Slide 87 Page 3 3

Superscalar architecture DLX: superscalar version Slide 88 Superscalar DLX Features - Pipelined, superscalar Two instructions per clock Branch-Target buffer Reorder-Buffer to commit instructions in program order Precise exception processing Four execution units with Reservation-Station Branch-Resolve unit Arithmetic-Logic unit Multiply-Divide unit Load-Store unit Write-Buffer Slide 89 Page 4 4

Example 2: DEC ALPHA AXP 21064 DEC ALPHA AXP 21064 superscalar architecture with multiple functional units 7 stage 5ns/stage integer pipeline the CPU can issue simultaneously any load/store in parallel with any operate an integer operate in parallel with a FP operate a floating operate and a floating branch an integer operate and an integer branch the most recent version of the ALPHA can issue 4 instructions simultaneously. 4 integer execution units 2 FP execution units Slide 90 Pentium architecture Superscalar architecture 2 independent integer pipelines one Floating point pipeline but control unit can issue either 2 integer instructions or 1 (occasionally 2) floating point instructions Branch prediction by loading code to cache from target address separate on-chip instruction and data caches both 8Kbytes long Integer pipelines are not identical u pipeline includes barrel shifter, but v is simpler (and can t t execute all integer instructions) Some elements of microcoded control using control ROM. Slide 91 Page 5 5

Pentium 3 Slide 92 Pentium 3 Superscalar processor 3 integer pipelines 2 FP pipelines 16KB instruction cache 16KB data cache L2 cache on chip Microcode system to deal with CISC instructions some x86 instructions -> > 1 microop others into 2 or 3 microops others into many more Slide 93 Page 6 6

Pentium 4 Slide 94 Pentium P4 Similar to P3, but designed for ultra-high clock rates Quad-pumped ( pipelined( pipelined ) ) system bus 20 stage pipeline (!) P3 has 10 Sophisticated branch prediction using BTB crucial for performance with long piplines. Ultrafast integer ALU (2xclock speed) for simple integer operations Execution trace Cache of uops instead of Instruction cache Smaller Data cache to allow 1-cycle 1 access New FP instructions but is anyone using them? Slide 95 Page 7 7

Hyperthreading Hyperthreading means sharing the execution resources between a number (normally 2) of sets of execution resources. Slide 96 Dual Core Dual core processors essentially have two complete processors on one silicon chip This allows an essentially dual processor machine to be built cheaply Sometimes (as in the Intel case) some caching (specifically L2 cache) is shared between the processors Note that this can be mixed with hyper threading Slide 97 Page 8 8