Multiple Instruction Issue. Superscalars

Similar documents
In-order vs. Out-of-order Execution. In-order vs. Out-of-order Execution

The Processor: Instruction-Level Parallelism

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Processor (IV) - advanced ILP. Hwansoo Han

Advanced Instruction-Level Parallelism

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

IF1/IF2. Dout2[31:0] Data Memory. Addr[31:0] Din[31:0] Zero. Res ALU << 2. CPU Registers. extension. sign. W_add[4:0] Din[31:0] Dout[31:0] PC+4

COMPUTER ORGANIZATION AND DESI

Chapter 4 The Processor (Part 4)

Lec 25: Parallel Processors. Announcements

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

CS425 Computer Systems Architecture

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Course on Advanced Computer Architectures

Lecture 9: Multiple Issue (Superscalar and VLIW)

COMPUTER ORGANIZATION AND DESIGN. The Hardware/Software Interface. Chapter 4. The Processor: C Multiple Issue Based on P&H

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

Static, multiple-issue (superscaler) pipelines

Chapter 4 The Processor 1. Chapter 4D. The Processor

Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

c. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations?

LECTURE 10. Pipelining: Advanced ILP

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

Chapter 4. The Processor

Multi-cycle Instructions in the Pipeline (Floating Point)

Chapter 4. The Processor

Homework 5. Start date: March 24 Due date: 11:59PM on April 10, Monday night. CSCI 402: Computer Architectures

EIE/ENE 334 Microprocessors

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

Four Steps of Speculative Tomasulo cycle 0

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Chapter 4. The Processor

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Hardware-Based Speculation

5 th Edition. The Processor We will examine two MIPS implementations A simplified version A more realistic pipelined version

Getting CPI under 1: Outline

In embedded systems there is a trade off between performance and power consumption. Using ILP saves power and leads to DECREASING clock frequency.

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

TDT 4260 lecture 7 spring semester 2015

Preventing Stalls: 1

CS 61C: Great Ideas in Computer Architecture. Multiple Instruction Issue, Virtual Memory Introduction

5008: Computer Architecture

Lecture: Static ILP. Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

CS/CoE 1541 Mid Term Exam (Fall 2018).

Lecture: Pipeline Wrap-Up and Static ILP

Exploitation of instruction level parallelism

Thomas Polzer Institut für Technische Informatik

Multiple Issue ILP Processors. Summary of discussions

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

EC 513 Computer Architecture

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Pipelining and Vector Processing

CS252 Graduate Computer Architecture Midterm 1 Solutions

Advanced issues in pipelining

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. P & H Chapter 4.10, 1.7, 1.8, 5.10, 6

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Chapter 4. The Processor

Dynamic Scheduling. CSE471 Susan Eggers 1

INSTRUCTION LEVEL PARALLELISM

Hardware-based Speculation

ECE 4750 Computer Architecture, Fall 2015 T16 Advanced Processors: VLIW Processors

November 7, 2014 Prediction

Full Datapath. Chapter 4 The Processor 2

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Advanced Computer Architecture

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Limitations of Scalar Pipelines

Ti Parallel Computing PIPELINING. Michał Roziecki, Tomáš Cipr

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

Multicore and Parallel Processing

EECC551 Exam Review 4 questions out of 6 questions

CS425 Computer Systems Architecture

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

CS146 Computer Architecture. Fall Midterm Exam

Keywords and Review Questions

Control Hazards. Prediction

Case Study IBM PowerPC 620

ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

ILP: COMPILER-BASED TECHNIQUES

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 32: Pipeline Parallelism 3

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines

ECE 4750 Computer Architecture, Fall 2018 T15 Advanced Processors: VLIW Processors

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Copyright 2012, Elsevier Inc. All rights reserved.

Chapter 3 (Cont III): Exploiting ILP with Software Approaches. Copyright Josep Torrellas 1999, 2001, 2002,

CS433 Midterm. Prof Josep Torrellas. October 16, Time: 1 hour + 15 minutes

ECE 154B Spring Project 4. Dual-Issue Superscalar MIPS Processor. Project Checkoff: Friday, June 1 nd, Report Due: Monday, June 4 th, 2018

References EE457. Out of Order (OoO) Execution. Instruction Scheduling (Re-ordering of instructions)

Handout 2 ILP: Part B

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING

Parallelism, Multicore, and Synchronization

Superscalar Organization

Lecture 4: Advanced Pipelines. Data hazards, control hazards, multi-cycle in-order pipelines (Appendix A.4-A.10)

Transcription:

Multiple Instruction Issue Multiple instructions issued each cycle better performance increase instruction throughput decrease in CPI (below 1) greater hardware complexity, potentially longer wire lengths harder code scheduling job for the compiler Superscalar processors instructions are scheduled for execution by the hardware different numbers of instructions may be issued simultaneously VLIW ( very long instruction word ) processors instructions are scheduled for execution by the compiler a fixed number of operations are formatted as one big instruction usually LIW (3 operations) today Fall 2004 CSE 471 1 Superscalars Definition: a processor that can execute more than one instruction per cycle for example, integer computation, floating point computation, data transfer, transfer of control issue width = the number of issue slots, 1 slot/instruction not all types of instructions can be issued together hardware decides which instructions to issue Fall 2004 CSE 471 2

2-way Superscalar Fall 2004 CSE 471 3 Superscalars Require: instruction fetch fetching of multiple instructions at once dynamic branch prediction & fetching speculatively beyond conditional branches instruction issue methods for determining which instructions can be issued next the ability to issue multiple instructions in parallel instruction commit methods for committing several instructions in fetch order duplicate & more complex hardware Fall 2004 CSE 471 4

In-order vs. Out-of-order Execution In-order instruction execution instructions are fetched, executed & committed in compilergenerated order if one instruction stalls, all instructions behind it stall instructions are statically scheduled by the hardware scheduled in compiler-generated order how many of the next n instructions can be issued, where n is the superscalar issue width superscalars can have structural & data hazards within the n instructions 2 styles of static instruction scheduling dispatch buffer & instruction slotting (Alpha 21164) shift register model (UltraSPARC-1) advantage of in-order instruction scheduling: simpler implementation faster clock cycle fewer transistors Fall 2004 CSE 471 5 In-order vs. Out-of-order Execution Out-of-order instruction execution instructions are fetched in compiler-generated order instruction completion may be in-order (today) or out-of-order (older computers) in between they may be executed in some other order instructions are dynamically scheduled by the hardware hardware decides in what order instructions can be executed instructions behind a stalled instruction can pass it advantages: higher performance better at hiding latencies, less processor stalling higher utilization of functional units Fall 2004 CSE 471 6

In-order instruction issue: Alpha 21164 Instruction slotting after decode, instructions are issued to functional units constraints on functional unit capabilities & therefore types of instructions that can issue together an example: 2 ALUs, 1 load/store unit, 1 FPU 1 ALU does shifts & integer multiplies; the other executes branches can issue up to 4 instructions completely empty the instruction buffer before fill it again compiler can pad with nops so the second (conflicting) instruction is issued with the following instructions, not alone no data dependences in same issue cycle (some exceptions) hardware to: detect data hazards control bypass logic Fall 2004 CSE 471 7 Fetch & issue S0: instruction fetch 21164 Instruction Unit Pipeline branch prediction bits read S1: opcode decode target address calculation if predict taken, redirect the fetch instruction TLB check S2: instruction slotting: decide which of the next 4 instructions can be issued intra-cycle structural hazard check intra-cycle data hazard check S3: instruction dispatch inter-cycle load-use & WAW data hazard checks inter-cycle structural hazard check register read Fall 2004 CSE 471 8

21164 Integer Pipeline Execute (2 pipelines) S4: integer execution effective address calculation S5: conditional move & branch execution data cache access S6: register write also a 9-stage FP pipeline Fall 2004 CSE 471 9 Fall 2004 CSE 471 10

In-order instruction issue: UltraSparc 1 Shift register model can issue up to 4 instructions per cycle any instruction in any slot (almost) choose from 2 integer, 2 FP/graphics, 1 load/store, 1 branch shift in new instructions after every group of instructions is issued some data dependent instructions can issue in same cycle Fall 2004 CSE 471 11 UltraSPARC 1 Fall 2004 CSE 471 12

Fall 2004 CSE 471 13 Superscalars Performance impact: increase performance because execute multiple instructions in parallel, not just overlapped CPI potentially < 1 (.5 on our R3000 example) IPC (instructions/cycle) potentially > 1 (2 on our R3000 example) better functional unit utilization but need to fetch more instructions how many? need independent instructions (i.e., good ILP) why? need a good local mix of instructions why? need more instructions to hide load delays why? need to make better branch predictions why? Fall 2004 CSE 471 14

Code Scheduling on Superscalars Original code Loop: lw R1, 0(R5) addu R1, R1, R6 sw R1, 0(R5) addi R5, R5, -4 bne R5, R0, Loop Fall 2004 CSE 471 15 Code Scheduling on Superscalars Original code Loop: lw R1, 0(R5) addu R1, R1, R6 sw R1, 0(R5) addi R5, R5, -4 bne R5, R0, Loop With latency-hiding code scheduling Loop: lw R1, 0(s1) addi R5, R5, -4 addu R1, R1, R6 sw R1, 4(R5) bne R5, $0, Loop ALU/branch instruction Data transfer instruction clock cycle Loop: 1 2 3 4 Fall 2004 CSE 471 16

Code Scheduling on Superscalars: Loop Unrolling ALU/branch instruction Data transfer instruction clock cycle Loop: addi R5, R5, -16 lw R1, 0(R5) 1 lw R2, 12(R5) 2 addu R1, R1, R6 lw R3, 8(R5) 3 addu R2, R2, R6 lw R4, 4(R5) 4 addu R3, R3, R6 sw R1, 16(R5) 5 addu R4, R4, R6 sw R2, 12(R5) 6 sw R3, 8(R5) 7 bne R5, R0, Loop sw R4, 4(R5) 8 What is the cycles per iteration? What is the IPC? Loop unrolling provides: + fewer instructions that cause hazards (I.e., branches) + more independent instructions (from different iterations) + increase in throughput - increases register pressure - must change offsets Fall 2004 CSE 471 17 Superscalars Hardware impact: more & pipelined functional units multi-ported registers for multiple register access more buses from the register file to the additional functional units multiple decoders more hazard detection logic more bypass logic wider instruction fetch multi-banked L1 data cache or else the processor has structural hazards (due to an unbalanced design) and stalling There are restrictions on instruction types that can be issued together to reduce the amount of hardware. Static (compiler) scheduling helps. Fall 2004 CSE 471 18

Modern Superscalars Alpha 21364: 4 instructions Pentium IV: 5 RISClike operations dispatched to functional units R12000: 4 instructions UltraSPARC-3: 6 instructions dispatched Fall 2004 CSE 471 19