William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors

Similar documents
TECH. CH14 Instruction Level Parallelism and Superscalar Processors. What is Superscalar? Why Superscalar? General Superscalar Organization

Organisasi Sistem Komputer

Lecture 9: Superscalar processing

A superscalar machine is one in which multiple instruction streams allow completion of more than one instruction per cycle.

What is Superscalar? CSCI 4717 Computer Architecture. Why the drive toward Superscalar? What is Superscalar? (continued) In class exercise

RISC & Superscalar. COMP 212 Computer Organization & Architecture. COMP 212 Fall Lecture 12. Instruction Pipeline no hazard.

Superscalar Processors Ch 13. Superscalar Processing (5) Computer Organization II 10/10/2001. New dependency for superscalar case? (8) Name dependency

Superscalar Processors Ch 14

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

Superscalar Processors

Superscalar Machines. Characteristics of superscalar processors

Superscalar Processors

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Lecture 19: Instruction Level Parallelism

Hardware-Based Speculation

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

Static, multiple-issue (superscaler) pipelines

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

Lecture 9: Dynamic ILP. Topics: out-of-order processors (Sections )

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 12 Processor Structure and Function

Instruction Level Parallelism

SUPERSCALAR AND VLIW PROCESSORS

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Ti Parallel Computing PIPELINING. Michał Roziecki, Tomáš Cipr

Hardware-based Speculation

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

CS425 Computer Systems Architecture

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Copyright 2012, Elsevier Inc. All rights reserved.

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

Handout 2 ILP: Part B

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Chapter 9. Pipelining Design Techniques

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

COMPUTER ORGANIZATION AND DESI

Instruction Level Parallelism

Intel released new technology call P6P

5008: Computer Architecture

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

Multiple Instruction Issue. Superscalars

CPE300: Digital System Architecture and Design

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

UNIT- 5. Chapter 12 Processor Structure and Function

CS311 Lecture: Pipelining, Superscalar, and VLIW Architectures revised 10/18/07

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

Pipelining and Vector Processing

Like scalar processor Processes individual data items Item may be single integer or floating point number. - 1 of 15 - Superscalar Architectures

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )

The Processor: Instruction-Level Parallelism

ECE 571 Advanced Microprocessor-Based Design Lecture 4

Chapter 06: Instruction Pipelining and Parallel Processing. Lesson 14: Example of the Pipelined CISC and RISC Processors

EC 513 Computer Architecture

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

Lecture 9: Multiple Issue (Superscalar and VLIW)

Advanced Computer Architecture

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

Getting CPI under 1: Outline

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

In embedded systems there is a trade off between performance and power consumption. Using ILP saves power and leads to DECREASING clock frequency.

CS146 Computer Architecture. Fall Midterm Exam

Chapter 4. The Processor

Architectural Performance. Superscalar Processing. 740 October 31, i486 Pipeline. Pipeline Stage Details. Page 1

Multi-cycle Instructions in the Pipeline (Floating Point)

Instruction Pipelining

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5

Instruction Pipelining

E0-243: Computer Architecture

Lecture: Out-of-order Processors

Static vs. Dynamic Scheduling

Processor (IV) - advanced ILP. Hwansoo Han

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

Pipelined Processor Design. EE/ECE 4305: Computer Architecture University of Minnesota Duluth By Dr. Taek M. Kwon

6.823 Computer System Architecture

Chapter 4 The Processor 1. Chapter 4D. The Processor

Exploitation of instruction level parallelism

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

Processing Unit CS206T

CS311 Lecture: Pipelining and Superscalar Architectures

Course on Advanced Computer Architectures

Branch Prediction & Speculative Execution. Branch Penalties in Modern Pipelines

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Dynamic Scheduling. CSE471 Susan Eggers 1

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2

CS 152, Spring 2011 Section 8

CS433 Homework 2 (Chapter 3)

COSC 243. Computer Architecture 2. Lecture 12 Computer Architecture 2. COSC 243 (Computer Architecture)

Lecture 18: Instruction Level Parallelism -- Dynamic Superscalar, Advanced Techniques,

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

Adapted from David Patterson s slides on graduate computer architecture

Lecture 26: Parallel Processing. Spring 2018 Jason Tang

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Control Hazards. Branch Prediction

Transcription:

William Stallings Computer Organization and Architecture 8 th Edition Chapter 14 Instruction Level Parallelism and Superscalar Processors

What is Superscalar? Common instructions (arithmetic, load/store, conditional branch) can be initiated and executed independently Equally applicable to RISC & CISC In practice usually RISC

Why Superscalar? Most operations are on scalar quantities (see RISC notes) Improve these operations to get an overall improvement

General Superscalar Organization

Superpipelined Many pipeline stages need less than half a clock cycle Double internal clock speed gets two tasks per external clock cycle Superscalar allows parallel fetch execute

Superscalar v Superpipeline

Limitations Instruction level parallelism Compiler based optimisation Hardware techniques Limited by True data dependency Procedural dependency Resource conflicts Output dependency Antidependency

True Data Dependency ADD r1, r2 (r1 := r1+r2;) MOVE r3,r1 (r3 := r1;) Can fetch and decode second instruction in parallel with first Can NOT execute second instruction until first is finished

Procedural Dependency Can not execute instructions after a branch in parallel with instructions before a branch Also, if instruction length is not fixed, instructions have to be decoded to find out how many fetches are needed This prevents simultaneous fetches

Resource Conflict Two or more instructions requiring access to the same resource at the same time e.g. two arithmetic instructions Can duplicate resources e.g. have two arithmetic units

Effect of Dependencies

Design Issues Instruction level parallelism Instructions in a sequence are independent Execution can be overlapped Governed by data and procedural dependency Machine Parallelism Ability to take advantage of instruction level parallelism Governed by number of parallel pipelines

Instruction Issue Policy Order in which instructions are fetched Order in which instructions are executed Order in which instructions change registers and memory

In-Order Issue In-Order Completion Issue instructions in the order they occur Not very efficient May fetch >1 instruction Instructions must stall if necessary

In-Order Issue In-Order Completion (Diagram)

In-Order Issue Out-of-Order Completion Output dependency R3:= R3 + R5; (I1) R4:= R3 + 1; (I2) R3:= R5 + 1; (I3) I2 depends on result of I1 - data dependency If I3 completes before I1, the result from I1 will be wrong - output (read-write) dependency

In-Order Issue Out-of-Order Completion (Diagram)

Out-of-Order Issue Out-of-Order Completion Decouple decode pipeline from execution pipeline Can continue to fetch and decode until this pipeline is full When a functional unit becomes available an instruction can be executed Since instructions have been decoded, processor can look ahead

Out-of-Order Issue Out-of-Order Completion (Diagram)

Antidependency Write-write dependency R3:=R3 + R5; (I1) R4:=R3 + 1; R3:=R5 + 1; (I2) (I3) R7:=R3 + R4; (I4) I3 can not complete before I2 starts as I2 needs a value in R3 and I3 changes R3

Reorder Buffer Temporary storage for results Commit to register file in program order

Register Renaming Output and antidependencies occur because register contents may not reflect the correct ordering from the program May result in a pipeline stall Registers allocated dynamically i.e. registers are not specifically named

Register Renaming example R3b:=R3a + R5a R4b:=R3b + 1 R3c:=R5a + 1 R7b:=R3c + R4b (I1) (I2) (I3) (I4) Without subscript refers to logical register in instruction With subscript is hardware register allocated Note R3a R3b R3c Alternative: Scoreboarding Bookkeeping technique Allow instruction execution whenever not dependent on previous instructions and no structural hazards

Machine Parallelism Duplication of Resources Out of order issue Renaming Not worth duplication functions without register renaming Need instruction window large enough (more than 8)

Speedups of Machine Organizations Without Procedural Dependencies

Branch Prediction 80486 fetches both next sequential instruction after branch and branch target instruction Gives two cycle delay if branch taken

RISC - Delayed Branch Calculate result of branch before unusable instructions pre-fetched Always execute single instruction immediately following branch Keeps pipeline full while fetching new instruction stream Not as good for superscalar Multiple instructions need to execute in delay slot Instruction dependence problems Revert to branch prediction

Superscalar Execution

Superscalar Implementation Simultaneously fetch multiple instructions Logic to determine true dependencies involving register values Mechanisms to communicate these values Mechanisms to initiate multiple instructions in parallel Resources for parallel execution of multiple instructions Mechanisms for committing process state in correct order

Pentium 4 80486 - CISC Pentium some superscalar components Two separate integer execution units Pentium Pro Full blown superscalar Subsequent models refine & enhance superscalar design

Pentium 4 Block Diagram

Pentium 4 Operation Fetch instructions form memory in order of static program Translate instruction into one or more fixed length RISC instructions (micro-operations) Execute micro-ops on superscalar pipeline micro-ops may be executed out of order Commit results of micro-ops to register set in original program flow order Outer CISC shell with inner RISC core Inner RISC core pipeline at least 20 stages Some micro-ops require multiple execution stages Longer pipeline c.f. five stage pipeline on x86 up to Pentium

Pentium 4 Pipeline

Pentium 4 Pipeline Operation (1)

Pentium 4 Pipeline Operation (2)

Pentium 4 Pipeline Operation (3)

Pentium 4 Pipeline Operation (4)

Pentium 4 Pipeline Operation (5)

Pentium 4 Pipeline Operation (6)

ARM CORTEX-A8 ARM refers to Cortex-A8 as application processors Embedded processor running complex operating system Wireless, consumer and imaging applications Mobile phones, set-top boxes, gaming consoles automotive navigation/entertainment systems Three functional units Dual, in-order-issue, 13-stage pipeline Keep power required to a minimum Out-of-order issue needs extra logic consuming extra power Figure 14.11 shows the details of the main Cortex-A8 pipeline Separate SIMD (single-instruction-multiple-data) unit 10-stage pipeline

ARM Cortex-A8 Block Diagram

Instruction Fetch Unit Predicts instruction stream Fetches instructions from the L1 instruction cache Up to four instructions per cycle Into buffer for decode pipeline Fetch unit includes L1 instruction cache Speculative instruction fetches Branch or exceptional instruction cause pipeline flush Stages: F0 address generation unit generates virtual address Normally next sequentially Can also be branch target address F1 Used to fetch instructions from L1 instruction cache In parallel fetch address used to access branch prediction arrays F3 Instruction data are placed in instruction queue If branch prediction, new target address sent to address generation unit Two-level global history branch predictor Branch Target Buffer (BTB) and Global History Buffer (GHB) Return stack to predict subroutine return addresses Can fetch and queue up to 12 instructions Issues instructions two at a time

Instruction Decode Unit Decodes and sequences all instructions Dual pipeline structure, pipe0 and pipe1 Two instructions can progress at a time Pipe0 contains older instruction in program order If instruction in pipe0 cannot issue, pipe1 will not issue Instructions progress in order Results written back to register file at end of execution pipeline Prevents WAR hazards Keeps tracking of WAW hazards and recovery from flush conditions straightforward Main concern of decode pipeline is prevention of RAW hazards

Instruction Processing Stages D0 Thumb instructions decompressed and preliminary decode is performed D1 Instruction decode is completed D2 Write instruction to and read instructions from pending/replay queue D3 Contains the instruction scheduling logic Scoreboard predicts register availability using static scheduling Hazard checking D4 Final decode for control signals for integer execute load/store units

Integer Execution Unit Two symmetric (ALU) pipelines, an address generator for load and store instructions, and multiply pipeline Pipeline stages: E0 Access register file Up to six registers for two instructions E1 Barrel shifter if needed. E2 ALU function E3 If needed, completes saturation arithmetic E4 Change in control flow prioritized and processed E5 Results written back to register file Multiply unit instructions routed to pipe0 Performed in stages E1 through E3 Multiply accumulate operation in E4

Load/store pipeline Parallel to integer pipeline E1 Memory address generated from base and index register E2 address applied to cache arrays E3 load, data returned and formatted E3 store, data are formatted and ready to be written to cache E4 Updates L2 cache, if required E5 Results are written to register file

ARM Cortex-A8 Integer Pipeline

SIMD and Floating-Point Pipeline SIMD and floating-point instructions pass through integer pipeline Processed in separate 10-stage pipeline NEON unit Handles packed SIMD instructions Provides two types of floating-point support If implemented, vector floating-point (VFP) coprocessor performs IEEE 754 floating-point operations If not, separate multiply and add pipelines implement floating-point operations

ARM Cortex-A8 NEON & Floating Point Pipeline

Required Reading Stallings chapter 14 Manufacturers web sites IMPACT web site research on predicated execution