A superscalar machine is one in which multiple instruction streams allow completion of more than one instruction per cycle.

Similar documents
William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors

Superscalar Processors

TECH. CH14 Instruction Level Parallelism and Superscalar Processors. What is Superscalar? Why Superscalar? General Superscalar Organization

What is Superscalar? CSCI 4717 Computer Architecture. Why the drive toward Superscalar? What is Superscalar? (continued) In class exercise

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.

Organisasi Sistem Komputer

Superscalar Machines. Characteristics of superscalar processors

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

Superscalar Processors Ch 14

Instruction Level Parallelism

Superscalar Processors

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

Lecture 9: Superscalar processing

Superscalar Processors Ch 13. Superscalar Processing (5) Computer Organization II 10/10/2001. New dependency for superscalar case? (8) Name dependency

Handout 2 ILP: Part B

RISC & Superscalar. COMP 212 Computer Organization & Architecture. COMP 212 Fall Lecture 12. Instruction Pipeline no hazard.

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Copyright 2012, Elsevier Inc. All rights reserved.

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

Chapter 13 Reduced Instruction Set Computers

E0-243: Computer Architecture

Tutorial 11. Final Exam Review

Branch Prediction & Speculative Execution. Branch Penalties in Modern Pipelines

Dynamic Scheduling. CSE471 Susan Eggers 1

COMPUTER ORGANIZATION AND DESI

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

Processing Unit CS206T

Chapter 5. Introduction ARM Cortex series

In embedded systems there is a trade off between performance and power consumption. Using ILP saves power and leads to DECREASING clock frequency.

Tutorial 4 KE What are the differences among sequential access, direct access, and random access?

Ti Parallel Computing PIPELINING. Michał Roziecki, Tomáš Cipr

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

Hyperthreading Technology

Basic Computer Architecture

Hardware-Based Speculation

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Pipelining, Branch Prediction, Trends

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

5008: Computer Architecture

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Pipeline issues. Pipeline hazard: RaW. Pipeline hazard: RaW. Calcolatori Elettronici e Sistemi Operativi. Hazards. Data hazard.

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5

Chapter 4. MARIE: An Introduction to a Simple Computer 4.8 MARIE 4.8 MARIE A Discussion on Decoding

Course on Advanced Computer Architectures

CS425 Computer Systems Architecture

CS Computer Architecture

Portland State University ECE 587/687. Memory Ordering

omputer Design Concept adao Nakamura

The Microarchitecture Level

Hardware Speculation Support

CPE300: Digital System Architecture and Design

SYLLABUS. osmania university CHAPTER - 1 : REGISTER TRANSFER LANGUAGE AND MICRO OPERATION CHAPTER - 2 : BASIC COMPUTER

Control Hazards. Branch Prediction

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2

COSC 243. Computer Architecture 2. Lecture 12 Computer Architecture 2. COSC 243 (Computer Architecture)

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Portland State University ECE 587/687. The Microarchitecture of Superscalar Processors

Instruction Pipelining

Communications and Computer Engineering II: Lecturer : Tsuyoshi Isshiki

Instruction Pipelining

Alexandria University

Static & Dynamic Instruction Scheduling

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

UNIT- 5. Chapter 12 Processor Structure and Function

Static vs. Dynamic Scheduling

The Processor: Instruction-Level Parallelism

Instruction Pipelining Review

Advanced Computer Architecture

CMSC411 Fall 2013 Midterm 2 Solutions

Like scalar processor Processes individual data items Item may be single integer or floating point number. - 1 of 15 - Superscalar Architectures

Modern Processors. RISC Architectures

Advanced Computer Architecture

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

Chapter 3 & Appendix C Part B: ILP and Its Exploitation

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 12 Processor Structure and Function

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Architectural Performance. Superscalar Processing. 740 October 31, i486 Pipeline. Pipeline Stage Details. Page 1

SRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design

EC 513 Computer Architecture

Multiple Instruction Issue. Superscalars

Full Datapath. Chapter 4 The Processor 2

Micro-programmed Control Ch 17

Lecture1: introduction. Outline: History overview Central processing unite Register set Special purpose address registers Datapath Control unit

ILP: Instruction Level Parallelism

Intel released new technology call P6P

Reduced Instruction Set Computer

Chapter 06: Instruction Pipelining and Parallel Processing

Computer Systems Architecture

November 7, 2014 Prediction

Micro-programmed Control Ch 15

Hardwired Control (4) Micro-programmed Control Ch 17. Micro-programmed Control (3) Machine Instructions vs. Micro-instructions

CS311 Lecture: Pipelining, Superscalar, and VLIW Architectures revised 10/18/07

Multithreaded Processors. Department of Electrical Engineering Stanford University

Transcription:

CS 320 Ch. 16 SuperScalar Machines A superscalar machine is one in which multiple instruction streams allow completion of more than one instruction per cycle. A superpipelined machine is one in which a single pipeline is double clocked so that two instructions are executed per clock cycle. Instruction level parallelism is a measure of the degree to which multiple instructions can be executed in parallel. Superscalar machines can execute instructions out of order. Either the compiler or the CPU hardware recognizes instruction sequences, determines dependencies, and allows rearrangement of instructions. The author lists five limitations to instruction level parallelism. True data dependency (RAW) procedural dependency resource conflicts output dependency (WAW) antidependency (WAR) Here is an example of a true data dependency. add r1, r2 mov r3, r1 ;can't do this until after the add In the pipeline for a true data dependency a NOP is added to allow one instruction to complete. A procedural dependency is one in which the instruction following a branch forms a procedural dependency. Without executing the branch you don't know what instruction to execute next. instruction level parallelism vs machine level parallelism Instruction level parallelism exists when instructions have no dependencies and thus can be executed in parallel. Machine level parallelism is the ability of a machine to exploit instruction level parallelism. Instructions come out of memory in proper order. The processor can then reorder these instructions and issue them to the pipeline in a different order to achieve some parallelism and efficiency. This process is called the instruction issue policy. There are three instruction issue policies in use. In-order issue with in-order completion in-order issue with out-of-order completion out-of-order issue with out-of-order completion.

The following figure illustrates issue policy. A) In-order issue with in-order completion B) in-order issue with out-of-order completion C) out-of-order issue with out-of-order completion. I1 requires 2 cycles to execute I3 and I4 conflict for the same functional unit I5 depends on the value produced by I4 I5 and I6 conflict for a functional unit. The register window that sits between the decode and the execute unit is a buffer that hold decoded instructions that are ready for execution. this buffer is significant because it is the thing that allows look ahead to determine what instructions can be done in parallel and in what order.

The following code is an example of an output dependency (WAW) I1: R3 R3 op 5 I2: R4 R3 + 1 I3: R3 R5 + 1 I4: R7 R3 op R4 I2 cannot execute before I1 since it needs the result in R3. This is a data dependency. I4 cannot execute before I3 for similar reasons. But I1 and I3 have an output dependency. If I3 executes before I1 then the wrong value will be loaded into R3. This is an output dependency. The following code illustrates an antidependency. (WAR) I1: R3 R3 op R5 I2: R4 R3 + 1 I3: R3 R5 + 1 I4: R7 R3 op R4 I3 cannot complete before I2 since I2 uses the old value of R3 and not the value created by I3. This is like a data dependency only reversed in time. Register renaming: To avoid antidependencies and output dependencies, a register may be allocated in two forms: one form has its old value and one form gets the new value. This allows the old value to be used while the new value is calculated and changed. This renaming is done in hardware. The author cites simulation studies that indicate that without register renaming, the gains from added functional units are not significant. ********************************************* PENTIUM 4 18. The Pentium 4 architecture is said to be a CISC shell around a RISC core. The Inherited CISC instruction set is converted to microoperations that are more RISC like. These are fed into a superscalar pipeline.

The figure below shows the Pentium 4 block diagram. Answer the questions below related to this diagram. A) The P4 is said to have an in-order front end. Instructions are fetched from memory in static order. B) The diagram shows three caches. L2 cache holds data an instructions. The trace cache is the instruction L1 cache. The L1 D cache is the L1 data cache. C) The P4 instructions are of variable length (inherited all the way back to the 8080). The trace cache gets 64 bytes at a time from the L2 cache. The fetch/decode unit scans the bytes and finds instruction boundaries. Each instruction is broken up into from one to four micro-ops each of which is 118-bits long. These RISC like instructions are then used in the instruction pipeline. D) The 118 bit micro-ops are needed to accommodate the CISC instructions on the P4. E) The micro-ops go into the L1 instruction cache (trace cache) after being generated. F) The branch history is done at the trace cache and the branch target buffer logic (BTB). There are 4 bits for branch history since the pipe is 20 stages long, accurate branch prediction is essential. If an instruction has no history then static branch prediction is used. G) For static branch prediction there are three rules for prediction conditional branches with no history: 1) If not IP relative predict taken if the branch is a return, otherwise not taken. 2) If IP relative backward predict taken. 3) If IP relative forward predict not taken. H) The micro-ops are executed out-of-order. The order depends on dependencies. I) Some instructions may require the execution of hundreds of microinstructions. For instructions which require more than four microops, the microop is sent to a microprogram ROM which in turn, generates hundreds of microinstructions if necessary.

J) The P4 does register renaming. It remaps 16 architectural registers into a set of 128 physical registers. It thereby removes antidependencies and output dependencies. K) Note that register renaming does not eliminate data dependencies. This is not possible. L) If a branch prediction is wrong micro-ops are flushed from the pipe and a correction is made to the branch predictor logic. M) It is critical that the P4 pipeline do a good job of branch prediction because the pipe is so long that it is expensive to flush. ARM Cortex A3 The figure below is an architectural diagram of the ARM Cortex M3 processor. In the center of the diagram is a Memory Protection Unit. This unit restricts access of data within a process so that two processes don't interfere with one another. This is used in multiprocess applications. This processor has a pipeline. Fetch, Decode, Execute There is also a Wake-up interrupt controller which allows processor to go to sleep in a very low-power mode and wake up in response to an external stimulus. This processor uses the Thumb-2 instruction set. This has mostly 16-bit Thumb1 instructions but has a few (5) 32-bit instructions. In the instruction decode stage of the pipeline the hardware decodes the instruction, generates addresses for the load/store unit, and does a branch based on offset or link register. The link register is a one level stack that is built into the registers. This processor does not do branch prediction. This is too complicated for a microcontroller. To deal with branches the processor uses Branch Forwarding and Branch Speculation. In Branch forwarding some branches are pushed ahead by one cycle in the pipeline to speed things up. In Branch speculation conditional branches the branch target is fetched speculatively before it is known that the branch will be taken.