Superscalar Machines. Characteristics of superscalar processors

Similar documents
Superscalar Processors

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

What is Superscalar? CSCI 4717 Computer Architecture. Why the drive toward Superscalar? What is Superscalar? (continued) In class exercise

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

Superscalar Processors Ch 14

A superscalar machine is one in which multiple instruction streams allow completion of more than one instruction per cycle.

Superscalar Processors Ch 13. Superscalar Processing (5) Computer Organization II 10/10/2001. New dependency for superscalar case? (8) Name dependency

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

Case Study IBM PowerPC 620

Hardware-based Speculation

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

The Processor: Instruction-Level Parallelism

Multiple Instruction Issue. Superscalars

SUPERSCALAR AND VLIW PROCESSORS

Pentium IV-XEON. Computer architectures M

Superscalar Processors

Architectural Performance. Superscalar Processing. 740 October 31, i486 Pipeline. Pipeline Stage Details. Page 1

Instruction Pipelining

CS425 Computer Systems Architecture

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Handout 2 ILP: Part B

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

Processor (IV) - advanced ILP. Hwansoo Han

TECH. CH14 Instruction Level Parallelism and Superscalar Processors. What is Superscalar? Why Superscalar? General Superscalar Organization

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.

Advanced Instruction-Level Parallelism

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Pipelined Processor Design. EE/ECE 4305: Computer Architecture University of Minnesota Duluth By Dr. Taek M. Kwon

EC 513 Computer Architecture

The Pentium II/III Processor Compiler on a Chip


CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

CS146 Computer Architecture. Fall Midterm Exam

Instruction Level Parallelism

CS 2410 Mid term (fall 2018)

Techniques for Mitigating Memory Latency Effects in the PA-8500 Processor. David Johnson Systems Technology Division Hewlett-Packard Company

Advanced Computer Architecture

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

Portland State University ECE 588/688. Cray-1 and Cray T3E

Computer Science 146. Computer Architecture

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011

Portland State University ECE 587/687. The Microarchitecture of Superscalar Processors

Lec 25: Parallel Processors. Announcements

Like scalar processor Processes individual data items Item may be single integer or floating point number. - 1 of 15 - Superscalar Architectures

Vector and Parallel Processors. Amdahl's Law

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2

Precise Exceptions and Out-of-Order Execution. Samira Khan

Next Generation Technology from Intel Intel Pentium 4 Processor

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation.

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Chapter 06: Instruction Pipelining and Parallel Processing. Lesson 14: Example of the Pipelined CISC and RISC Processors

The Microarchitecture Level

Keywords and Review Questions

Uniprocessors. HPC Fall 2012 Prof. Robert van Engelen

Chapter 13 Reduced Instruction Set Computers

Portland State University ECE 588/688. IBM Power4 System Microarchitecture

Multithreaded Processors. Department of Electrical Engineering Stanford University

Course on Advanced Computer Architectures

CS 152, Spring 2011 Section 8

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5

COMPUTER ORGANIZATION AND DESI

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Organisasi Sistem Komputer

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

IF1 --> IF2 ID1 ID2 EX1 EX2 ME1 ME2 WB. add $10, $2, $3 IF1 IF2 ID1 ID2 EX1 EX2 ME1 ME2 WB sub $4, $10, $6 IF1 IF2 ID1 ID2 --> EX1 EX2 ME1 ME2 WB

Basic Computer Architecture

Wide Instruction Fetch

Chapter 4 The Processor (Part 4)

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING

Hardware-Based Speculation

Hyperthreading 3/25/2008. Hyperthreading. ftp://download.intel.com/technology/itj/2002/volume06issue01/art01_hyper/vol6iss1_art01.

Chapter 9 Pipelining. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan

Static vs. Dynamic Scheduling

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

5008: Computer Architecture

CS433 Homework 2 (Chapter 3)

BOBCAT: AMD S LOW-POWER X86 PROCESSOR

Instruction Pipelining

Lecture 19: Instruction Level Parallelism

For this problem, consider the following architecture specifications: Functional Unit Type Cycles in EX Number of Functional Units

Dynamic Scheduling. CSE471 Susan Eggers 1

RISC & Superscalar. COMP 212 Computer Organization & Architecture. COMP 212 Fall Lecture 12. Instruction Pipeline no hazard.

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University

Copyright 2012, Elsevier Inc. All rights reserved.

Super Scalar. Kalyan Basu March 21,

Design and Implementation of a Super Scalar DLX based Microprocessor

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Microarchitecture Overview. Performance

Lecture 9: Superscalar processing

CS433 Homework 2 (Chapter 3)

November 7, 2014 Prediction

Transcription:

Superscalar Machines Increasing pipeline length eventually leads to diminishing returns longer pipelines take longer to re-fill data and control hazards lead to increased overheads, removing any performance advantage yet there is still space for more circuitry on-chip An alternative is the multiple-issue processor I.e. issue more than one instruction each clock cycle that is, start more than one instruction Such processors are called Superscalar processors Copyright 1998, 2002 Leslie S. Smith 31R6 - Computer Design Slide 74 Characteristics of superscalar processors Superscalar processors issue more than one instruction each cycle the number of instructions issued will depend on the instructions in the instruction stream instructions are often re-ordered to fit the processor architecture better typically, between 1 and 8 instructions can be issued each cycle The simplest possibility is a machine which issues one integer instruction one floating-point instruction per clock cycle but this only gives appreciable improvements if there are many FP instructions to be executed. Copyright 1998, 2002 Leslie S. Smith 31R6 - Computer Design Slide 75

Ordering instructions 1 Instructions are produced in a particular order by the compiler, but need not (necessarily) be executed in that order. They may be issued in order and completed in order issued in order, but completed out of order issued out-of-order, but completed in order issued out of order and completed out of order. Copyright 1998, 2002 Leslie S. Smith 31R6 - Computer Design Slide 76 Ordering instructions 2 The best order for instructions in a particular superscalar architecture depends on the architecture itself the precise dependencies between instructions the actual order they are executed in may be set up by the compiler in which case it must know the architecture complex code-generating compiler or the instructions may be reordered by the CPU itself many instructions are fetched, but the precise order of execution is decided at execution time this is more common Copyright 1998, 2002 Leslie S. Smith 31R6 - Computer Design Slide 77

Register renaming one of the main techniques for dealing with data hazards is register renaming registers are dynamically allocated by the hardware out of a register set which is larger than the programmer s model of the architecture subsequently executing instructions use whichever register is correct for them. Usually the actual register which a programmer s register refers to is re-allocated from time to time Limits to multiple issue machines there are inherent limits to the extent to which sequential instructions can be executed concurrently these show up as data and control hazards there are limits to the amount of underlying hardware that can be produced these show up as structural hazards Copyright 1998, 2002 Leslie S. Smith 31R6 - Computer Design Slide 78 Branch Prediction Since branches can cause major delays in pipelines, some machines attempt to predict what the outcome of a branch will be Like the 68040, the Pentium attempts to limit the problems which branches cause by fetching instructions from both the target address and the branch address However, it predicts which will occur from what has happened previously and (speculatively) executes from that address. Other techniques can use the compiler to predict whether the branch is more or less likely to be taken Copyright 1998, 2002 Leslie S. Smith 31R6 - Computer Design Slide 79

Superscalar architecture DLX: superscalar version Copyright 1998, 2002 Leslie S. Smith 31R6 - Computer Design Slide 80 Superscalar DLX Features - Pipelined, superscalar Two instructions per clock Branch-Target buffer Reorder-Buffer to commit instructions in program order Precise exception processing Four execution units with Reservation-Station Branch-Resolve unit Arithmetic-Logic unit Multiply-Divide unit Load-Store unit Write-Buffer Copyright 1998, 2002 Leslie S. Smith 31R6 - Computer Design Slide 81

Example 2: DEC ALPHA AXP 21064 DEC ALPHA AXP 21064 superscalar architecture with multiple functional units 7 stage 5ns/stage integer pipeline the CPU can issue simultaneously any load/store in parallel with any operate an integer operate in parallel with a FP operate a floating operate and a floating branch an integer operate and an integer branch the most recent version of the ALPHA can issue 4 instructions simultaneously. 4 integer execution units 2 FP execution units Copyright 1998, 2002 Leslie S. Smith 31R6 - Computer Design Slide 82 Pentium architecture Superscalar architecture 2 independent integer pipelines one Floating point pipeline but control unit can issue either 2 integer instructions or 1 (occasionally 2) floating point instructions Branch prediction by loading code to cache from target address separate on-chip instruction and data caches both 8Kbytes long Integer pipelines are not identical u pipeline includes barrel shifter, but v is simpler (and can t execute all integer instructions) Some elements of microcoded control using control ROM. Copyright 1998, 2002 Leslie S. Smith 31R6 - Computer Design Slide 83

Pentium 3 Copyright 1998, 2002 Leslie S. Smith 31R6 - Computer Design Slide 84 Pentium 3 Superscalar processor 3 integer pipelines 2 FP pipelines 16KB instruction cache 16KB data cache L2 cache on chip Microcode system to deal with CISC instructions some x86 instructions -> 1 microop others into 2 or 3 microops others into many more Copyright 1998, 2002 Leslie S. Smith 31R6 - Computer Design Slide 85

Pentium 4 Copyright 1998, 2002 Leslie S. Smith 31R6 - Computer Design Slide 86 Pentium P4 Similar to P3, but designed for ultra-high clock rates Quad-pumped ( pipelined ) system bus 20 stage pipeline (!) P3 has 10 Sophisticated branch prediction using BTB crucial for performance with long piplines. Ultrafast integer ALU (2xclock speed) for simple integer operations Execution trace Cache of uops instead of Instruction cache Smaller Data cache to allow 1-cycle access New FP instructions but is anyone using them? Copyright 1998, 2002 Leslie S. Smith 31R6 - Computer Design Slide 87