ECE 341. Lecture # 15

Similar documents
Lecture Topics ECE 341. Lecture # 14. Unconditional Branches. Instruction Hazards. Pipelining

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version:

Pipelining. CSC Friday, November 6, 2015

Complex Pipelines and Branch Prediction

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

Chapter 4. The Processor

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

ECE 486/586. Computer Architecture. Lecture # 7

Lecture Topics ECE 341. Lecture # 8. Functional Units. Information Processed by a Computer

Ti Parallel Computing PIPELINING. Michał Roziecki, Tomáš Cipr

What is Pipelining? RISC remainder (our assumptions)

ECE 341 Final Exam Solution

ECE260: Fundamentals of Computer Engineering

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )

Advanced Computer Architecture

Chapter 4. The Processor

LECTURE 3: THE PROCESSOR

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.

Portland State University ECE 587/687. The Microarchitecture of Superscalar Processors

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

CS311 Lecture: Pipelining, Superscalar, and VLIW Architectures revised 10/18/07

ECE 587 Advanced Computer Architecture I

Lecture-13 (ROB and Multi-threading) CS422-Spring

Announcements. ECE4750/CS4420 Computer Architecture L11: Speculative Execution I. Edward Suh Computer Systems Laboratory

COMPUTER ORGANIZATION AND DESI

More advanced CPUs. August 4, Howard Huang 1

4. What is the average CPI of a 1.4 GHz machine that executes 12.5 million instructions in 12 seconds?

ECE 154A Introduction to. Fall 2012

Instruction Level Parallelism (ILP)

ECE 341. Lecture # 10

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

Week 11: Assignment Solutions

ECE 486/586. Computer Architecture. Lecture # 12

Control Hazards. Branch Prediction

Instr. execution impl. view

SUPERSCALAR AND VLIW PROCESSORS

COMPUTER ORGANIZATION AND DESIGN

The Processor: Instruction-Level Parallelism

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

Control Hazards. Prediction

COMPUTER ORGANIZATION AND DESIGN

What is Superscalar? CSCI 4717 Computer Architecture. Why the drive toward Superscalar? What is Superscalar? (continued) In class exercise

Advanced processor designs

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

CS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin. School of Information Science and Technology SIST

Instruction Pipelining Review

Lecture 15: Pipelining. Spring 2018 Jason Tang

Processor Architecture V! Wrap-Up!

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards

LSU EE 4720 Dynamic Scheduling Study Guide Fall David M. Koppelman. 1.1 Introduction. 1.2 Summary of Dynamic Scheduling Method 3

Lecture Topics ECE 341. Lecture # 10. Register File. Hardware Components of a Processor

Lecture 26: Parallel Processing. Spring 2018 Jason Tang

Lecture: Pipelining Basics

Instruction Level Parallelism

In embedded systems there is a trade off between performance and power consumption. Using ILP saves power and leads to DECREASING clock frequency.

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Advanced issues in pipelining

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

COSC 6385 Computer Architecture - Pipelining

Lecture Topics. Why Predict Branches? ECE 486/586. Computer Architecture. Lecture # 17. Basic Branch Prediction. Branch Prediction

Superscalar Processors

Portland State University ECE 587/687. Superscalar Issue Logic

CENG 3531 Computer Architecture Spring a. T / F A processor can have different CPIs for different programs.

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

Dynamic Scheduling. CSE471 Susan Eggers 1

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

Processor (IV) - advanced ILP. Hwansoo Han

Portland State University ECE 587/687. Memory Ordering

Pipelining and Vector Processing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

CSEE 3827: Fundamentals of Computer Systems

Full Datapath. Chapter 4 The Processor 2

Lecture 9: Dynamic ILP. Topics: out-of-order processors (Sections )

Chapter 04: Instruction Sets and the Processor organizations. Lesson 20: RISC and converged Architecture

T T T T T T N T T T T T T T T N T T T T T T T T T N T T T T T T T T T T T N.

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

Chapter 9 Pipelining. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Computer Hardware Engineering

Preventing Stalls: 1

Lecture 1 An Overview of High-Performance Computer Architecture. Automobile Factory (note: non-animated version)

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

Basic concepts UNIT III PIPELINING. Data hazards. Instruction hazards. Influence on instruction sets. Data path and control considerations

Pipeline Review. Review

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

Cycle Time for Non-pipelined & Pipelined processors

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Copyright 2012, Elsevier Inc. All rights reserved.

Modern Processors. RISC Architectures

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline

Limitations of Scalar Pipelines

Transcription:

ECE 341 Lecture # 15 Instructor: Zeshan Chishti zeshan@ece.pdx.edu November 19, 2014 Portland State University

Pipelining Structural Hazards Pipeline Performance Lecture Topics Effects of Stalls and Penalties Superscalar Processors CISC Processors Reference: Chapter 6: Sections 6.7, 6.8, 6.9 and 6.10 (Pages 209 214, 218-220)

Structural Hazards A structural hazard occurs when there are insufficient hardware resources to permit all actions to proceed correctly In a pipelined processor, if two (or more) instructions require the use of a given resource in the same clock cycle, one instruction must be stalled to allow the other instruction to use the resource For example: Consider a processor that has a single cache that supports only one access per cycle An instruction fetch operation may need to be stalled to allow a load instruction in its memory stage to access the cache Structural hazards are prevented by providing additional hardware In the above example, using separate caches for instructions and data allows the Fetch and Memory stages to proceed simultaneously without stalling

Pipeline Performance

Performance Evaluation Recall that the execution time (T) of a program is given by: T = (N * CPI) / R Instruction throughput (P) is the rate of instruction execution: P = N / T = R / CPI For a non-pipelined processor with n stages, each instruction is completed in n cycles (assuming single-cycle memory access) and instructions are executed one after the other. Therefore throughput of a non-pipelined processor (P np ) is given by: P np = R / n

Performance Evaluation (cont.) For a pipelined processor, throughput depends on how frequently instructions can complete and come out of the pipeline In the ideal case of no stalls, one instruction is completed every cycle. Therefore the ideal throughput with pipelining (P p ) is given by: P p = R / 1 = R Clock rate R is determined by the delay of the longest pipeline stage For example, if the longest pipeline stage requires a delay of 2 ns, then R = 1/2 ns = 500 MHz & P p = 500 MIPS (million instructions per second)

Performance Evaluation (cont.) An n-stage pipeline has the potential to increase throughput by n times Two questions regarding pipeline performance: How much of the potential increase in instruction throughput can actually be realized in the presence of stalls? What is a good value for n (number of pipeline stages)?

Impact of Pipeline Stalls on Throughput In the presence of stalls, there are cycles in which no instruction is completed in a pipelined processor The impact of pipeline stalls on throughout depends on: How frequently do stalls occur? How much is the penalty for each stall in terms of number of cycles? To calculate the impact of stalls on performance, we need to first calculate the number of stall cycles per instruction (δ) δ is the product of stall frequency and stall penalty For example if 40% of the instructions stall for 1 cycle each, δ = 0.4 All the different sources of stalls are added to compute total δ Throughput in the presence of stalls is given by: P p = R / (1 + δ)

Impact of Data Hazards Recall that with operand forwarding, data hazards have no penalties except a 1-cycle penalty when a Load instruction is immediately followed by a dependent instruction Example: Assume that load instructions constitute 25% of dynamic instruction count and assume that 40% of Load instructions are followed by a dependent instruction. What is the reduction in throughput compared to ideal case? Solution: δ = Stall frequency * stall penalty = (0.25 * 0.4) * 1 = 0.1 P p = R / (1+ δ) = R / (1 + 0.1) = 0.91R Throughput is reduced by 9%

Impact of Branch Penalties Recall that branch prediction incurs a penalty whenever the branch predictor mispredicts the outcome of a branch instruction Branch penalty depends on when the actual branch decision and target address are computed If the branch computation is done in Decode stage, penalty is one cycle Example: Assume that branches constitute 20% of dynamic instruction count and the average branch prediction accuracy is 90%. What is the impact of branch misprediction on throughout? Solution: δ = Stall frequency * stall penalty = (0.2 * (1-0.9)) * 1 = 0.02 P p = R / (1+ δ) = R / (1 + 0.02) = 0.98R Throughput is reduced by 2%

Impact of Cache Misses Cache misses cause pipeline stalls in two scenarios: Cache miss during instruction fetch operation for any instruction Cache miss during data read/write operation for a Load/Store instruction These two components of pipeline stall are additive Example: Assume that 5% of all instruction fetch operations incur a cache miss, 30% of all instructions executed are Load/Store, and 10% of data accesses incur a cache miss. Assume that the penalty to access the main memory for a cache miss is 10 cycles? What is the impact of cache misses on throughput? Solution: δ = Stall frequency * stall penalty = (0.05 + (0.3 * 0.1)) * 10 = 0.8 P p = R / (1+ δ) = R / (1 + 0.8) = 0.56R Throughput is reduced by 44%

Number of Pipeline Stages Since an n-stage pipeline has the potential to increase the throughput by n times, how about we use a 10,000-stage pipeline? As the number of stages increase, the number of pipeline stalls also increase because: More instructions being overlapped => more potential data dependencies => higher stall probability Branch penalty may be larger than one cycle, if a longer pipeline moves the branch decision to a later stage There is also an inherent limit to pipelining, because it is impractical to place pipeline registers at arbitrarily fine granularities Some recent processor implementations have used ~ 20 pipeline stages, enabling clock rates of several GHz However further increases are unlikely, due to power/area constraints and limited performance benefits

Superscalar Processors

Overview The max. throughput of a pipelined processor is 1 instr. per cycle If we equip the processor with multiple execution units, each capable of executing an independent instruction, several instructions start execution in the same clock cycle multiple-issue Such processors are capable of achieving a throughput of more than one instruction per cycle superscalar processors Most modern high performance processors are superscalar processors To enable multiple-issue execution, a pipelined processor needs modification to all the pipeline stages Multiple instruction fetched per cycle and placed in an instruction queue Multiple instruction decoded in each cycle More register ports to read register operands for multiple instrs. in parallel

Multiple-Issue Execution To enable multiple-issue execution, a pipelined processor needs modification to all the pipeline stages Fetch stage: Multiple instruction fetched from the instruction cache per cycle and placed in an instruction queue Decode stage: Multiple instruction decoded in each cycle and sent to the appropriate execution units. More register read ports are needed to enable source register reads for multiple instructions Compute Stage: Multiple execution units (e.g., separate units for arithmetic instructions and load/store instructions) Writeback Stage: Multiple instructions write their results to register file in parallel => need more register write ports

Superscalar Processor Fetch Unit... Instruction queue Dispatch Unit Arithmetic Unit Load/Store Unit Write results

Example of Superscalar Execution Clock Cycle 1 2 3 4 5 6 Time Add R2, R3, #100 Load R5, 16(R6) Subtract R7, R8, R9 F D C M W F D C M W F D C M W Store R10, 24(R11) F D C M W Throughput = 2 instructions per clock cycle

Stalls in Superscalar Processors Even though superscalar processors use multiple execution units, pipeline hazards can still hurt their throughput. Examples: Structural Hazard: The execution units may not necessarily match the available instructions. For example, two arithmetic instruction may be ready for issue, but there is only arithmetic unit and one load/store unit Data Hazards: The instructions that are ready for issue may have a data dependency. The second instruction needs to stall, even if there is an idle execution unit (operand forwarding does not solve this problem) Instruction Hazards: All the instructions fetched and executed after a mispredicted branch need to be discarded and their results annulled

Example of Structural Hazard Time Clock Cycle 1 2 3 4 5 6 7 Add R2, R3, #100 Subtract R7, R8, R9 F D C M W F D C M W Stall due to unavailability of arithmetic unit Load R5, 16(R6) F D C M W Store R10, 24(R11) F D C M W

CISC Processors

Pipelining in CISC Processors CISC-style instructions complicate pipelining because of their irregular encoding formats complex behaviors Some of the common reasons why CISC pipelining is complex: Instructions have variable sizes => complicates fetching and decoding Instructions have multiple memory operands => some instructions require more time while others require less time in Memory stage; complicates the pipeline stalling logic Instructions use complex addressing modes which sometimes cause both the source and destination registers to change values => hard to track instruction dependences and complicates operand forwarding Complex pipelining in CISC processors was one of the key reasons for developing the RISC approach

Pipelining in Intel Processors Intel processors achieve high performance with superscalar execution and deep pipelines Intel Core-i7 uses issue width of 4 instructions and a 14-stage pipeline Even though Intel processors use CISC instruction sets, the execution hardware is designed like a RISC processor to reduce hardware complexity CISC-style instructions are dynamically converted into simpler RISCstyle micro-operations (micro-ops) Micro-ops are issued to the execution units to complete the tasks specified by the original CISC instructions This approach preserves (backward) code compatibility while enabling the advantages of simpler RISC-style pipelining