Computer Hardware Engineering

Similar documents
Computer Hardware Engineering

Computer Hardware Engineering

Computer Hardware Engineering

Computer Organization and Components

ENGN1640: Design of Computing Systems Topic 04: Single-Cycle Processor Design

Computer Organization and Components

Computer Hardware Engineering

Fundamentals of Computer Systems

EECS150 - Digital Design Lecture 10- CPU Microarchitecture. Processor Microarchitecture Introduction

EECS 151/251A Fall 2017 Digital Design and Integrated Circuits. Instructor: John Wawrzynek and Nicholas Weaver. Lecture 13 EE141

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

ENGN1640: Design of Computing Systems Topic 04: Single-Cycle Processor Design

EECS150 - Digital Design Lecture 9- CPU Microarchitecture. Watson: Jeopardy-playing Computer

Lecture 4: Review of MIPS. Instruction formats, impl. of control and datapath, pipelined impl.

Chapter 4. The Processor

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Pipelined Processor Design

Pipelined Processor Design

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

COMPUTER ORGANIZATION AND DESIGN

CENG 3420 Computer Organization and Design. Lecture 06: MIPS Processor - I. Bei Yu

LECTURE 3: THE PROCESSOR

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESI

Pipelining. CSC Friday, November 6, 2015

Chapter 4. The Processor

Chapter 7. Microarchitecture. Copyright 2013 Elsevier Inc. All rights reserved.

CPE 335 Computer Organization. Basic MIPS Architecture Part I

Design of the MIPS Processor (contd)

CPSC 313, 04w Term 2 Midterm Exam 2 Solutions

Chapter 4 The Processor 1. Chapter 4A. The Processor

CENG 3420 Lecture 06: Datapath

ECE 341. Lecture # 15

CENG 3531 Computer Architecture Spring a. T / F A processor can have different CPIs for different programs.

4. What is the average CPI of a 1.4 GHz machine that executes 12.5 million instructions in 12 seconds?

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

Design of the MIPS Processor

Processor (I) - datapath & control. Hwansoo Han

Chapter 4 The Processor 1. Chapter 4B. The Processor

Complex Pipelines and Branch Prediction

Lecture 7 Pipelining. Peng Liu.

Full Datapath. Chapter 4 The Processor 2

COSC 6385 Computer Architecture - Pipelining

Design of Digital Circuits 2017 Srdjan Capkun Onur Mutlu (Guest starring: Frank K. Gürkaynak and Aanjhan Ranganathan)

Perfect Student CS 343 Final Exam May 19, 2011 Student ID: 9999 Exam ID: 9636 Instructions Use pencil, if you have one. For multiple choice

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

The Processor. Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut. CSE3666: Introduction to Computer Architecture

Systems Architecture

Computer Architecture. Lecture 6.1: Fundamentals of

COMPUTER ORGANIZATION AND DESIGN

CS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin. School of Information Science and Technology SIST

EE557--FALL 1999 MAKE-UP MIDTERM 1. Closed books, closed notes

COMP303 - Computer Architecture Lecture 8. Designing a Single Cycle Datapath

CS 152 Computer Architecture and Engineering Lecture 4 Pipelining

CSC258: Computer Organization. Microarchitecture

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 4: Datapath and Control

CS3350B Computer Architecture Quiz 3 March 15, 2018

The Processor: Improving the performance - Control Hazards

COMPUTER ORGANIZATION AND DESIGN. The Hardware/Software Interface. Chapter 4. The Processor: A Based on P&H

Lecture 16: Pipeline Controls. Spring 2018 Jason Tang

CO Computer Architecture and Programming Languages CAPL. Lecture 18 & 19

EITF20: Computer Architecture Part2.2.1: Pipeline-1

EITF20: Computer Architecture Part2.2.1: Pipeline-1

COMP2611: Computer Organization. The Pipelined Processor

COMPUTER ORGANIZATION AND DESIGN

14:332:331 Pipelined Datapath

CS 61C: Great Ideas in Computer Architecture. Lecture 13: Pipelining. Krste Asanović & Randy Katz

Computer Organization and Components

LECTURE 5. Single-Cycle Datapath and Control

Chapter 4. The Processor. Computer Architecture and IC Design Lab

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

CS/COE0447: Computer Organization

CS/COE0447: Computer Organization

CS 3330 Exam 2 Spring 2016 Name: EXAM KEY Computing ID: KEY

The Processor (1) Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

EXAM #1. CS 2410 Graduate Computer Architecture. Spring 2016, MW 11:00 AM 12:15 PM

Computer Architecture V Fall Practice Exam Questions

CS 351 Exam 2, Fall 2012

ECS 154B Computer Architecture II Spring 2009

Cycle Time for Non-pipelined & Pipelined processors

Full Datapath. CSCI 402: Computer Architectures. The Processor (2) 3/21/19. Fengguang Song Department of Computer & Information Science IUPUI

Computer Architecture CS372 Exam 3

EECS150 - Digital Design Lecture 9 Project Introduction (I), Serial I/O. Announcements

Single Instructions Can Execute Several Low Level

Improving Performance: Pipelining

The Processor: Instruction-Level Parallelism

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Processor Architecture

These actions may use different parts of the CPU. Pipelining is when the parts run simultaneously on different instructions.

Processor Design Pipelined Processor (II) Hung-Wei Tseng

Computer Architecture

Lecture 9 Pipeline and Cache

Advanced Computer Architecture

EECS Digital Design

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Processor Design Pipelined Processor. Hung-Wei Tseng

Transcription:

Computer Hardware ngineering IS2, spring 25 Lecture 6: Pipelined Processors ssociate Professor, KTH Royal Institute of Technology ssistant Research ngineer, University of California, Berkeley Slides version. 2 Course Structure Module : Logic Design Module : I/O Systems L L DCÖ DCÖ2 Lab:dicom L7 6 7 Lab: nios2io 9 Lab: nios2int Module 2: C and ssembly Programming Module 5: Hierarchy L L2 L8 8 Home Lab: cache L 2 Lab: nios2time L Home lab: C Module : Processor Design Module 6: Parallel Processors and Programs L5 L6 5 L9 L

bstractions in Computer Systems Networked Systems and Systems of Systems Computer System pplication Software Software Operating System Hardware/Software Interface Set rchitecture Microarchitecture Digital Hardware Design Logic and Building Blocks Digital Circuits nalog Circuits nalog Design and Physics Devices and Physics genda by Raysonho @ Open Grid Scheduler / Grid ngine - Own work. Licensed under Creative Commons Zero, Public Domain

5 cknowledgement: The structure and several of the good examples are derived from the book Digital Design and Computer rchitecture (2) by D. M. Harris and S. L. Harris. 6 Jump Path (Revisited) Inst W 25:2 Zero 2:6 PC 2 2 :28 27: 25: 5: RegWrite Branch RegDst LUSrc LUControl 2:6 5: Sign xtend LU W MemWrite MemToReg

7 Control Unit (Revisited) Decoding the instruction op 5 Main Decoder RegWrite RegDst LUSrc Branch MemWrite MemToReg Jump Control signals to the data path LUOp 2 funct 6 LU Decoder LUControl 8 Performance nalysis (Revisited) xecution time (in seconds) = # instructions clock cycles instruction seconds clock cycle Number of instructions in a program (# = number of) Determined by programmer or the compiler or both. verage cycles per instruction (CPI) Determined by the microarchitecture implementation. Seconds per cycle = clock period T C. Determined by the critical path in the logic. For the single-cycle processor, each instruction takes one clock cycle. That is, CPI =. The main problem with the single-cycle processor design (last lecture) is the long critical path. Solution: Pipelining

Parallelism and Pipelining (/6) Definitions 9 Processing System: system that takes input and produces outputs. Token: n input that is processed by the processing system and results in an output. Latency: The time it takes for the system to process one token. Throughput: The number of tokens that can be processed per time unit. Parallelism and Pipelining (2/6) Sequential Processing xample: ssume we have a Christmas card factory with two machines (M and M2). pproach. Process tokens sequentially. In this case a token is a card. The latency is 6 = s M: Prints out the card (takes 6s) M2: Puts on a stamp (takes s) The throughput is / =. tokens per second or 6 tokens per minute. M M2 M M2 2 6 8 2 6 8 2 22 2 26 s

Parallelism and Pipelining (/6) Parallel Processing (Spatial Parallelism) xample: ssume we have a Christmas card factory with four machines. pproach 2. Process tokens in parallel using more machines. The latency is 6 = s M: Prints out the card (takes 6s) M2: Puts on a stamp (takes s) M: Prints out the card (takes 6s) M: Puts on a stamp (takes s) The throughput is 2 * / =.2 tokens per second or 2 tokens per minute. M M2 M M2 M M M M 2 6 8 2 6 8 2 22 2 26 s Parallelism and Pipelining (/6) Pipelining (Temporal Parallelism) 2 xample: ssume we have a Christmas card factory with two machines. pproach. Process tokens by pipelining using only two machines. The latency is still 6 = s M: Prints out the card (takes 6s) M2: Puts on a stamp (takes s) The throughput is /6 =.666 tokens per second or tokens per minute. M M2 The factory starts the production of a new card every 6 second M M2 M M2 2 6 8 2 6 8 2 22 2 26 s

Parallelism and Pipelining (5/6) Summary pproach. Process tokens sequentially using two machines pproach 2. Process tokens in parallel using four machines pproach. Process tokens by pipelining using only two machines. M Latency: s Throughput: 6 tokens/min We improve throughput, but not latency Latency: s Throughput: 2 tokens/min Latency: s Throughput: tokens/min M2 Spatial parallelism adds extra machines, but pipelining does not Throughput improvements are limited by the slowest machine (in this case M) Parallelism and Pipelining (6/6) Performance nalysis for Pipelining Idea: We introduce a pipeline in the processor How does this affect the execution time? xecution time (in seconds) = # instructions clock cycles instruction seconds clock cycle Pipelining does not change the number of instructions Pipelining will not improve the CPI (actually, make it slightly worse) Pipelining will improve the cycle period (make the critical path shorter)

Towards a Pipelined (/8) 5 Recall the single-cycle data path (the logic for the j and beq instructions is hidden) PC next PC Inst W 25:2 2:6 2 2 2:6 5: 5: Sign xtend LU W Towards a Pipelined (2/8) Fetch Stage 6 register splits the datapath into stages, forming a pipeline. First, we introduce a instruction fetch stage. PC next Inst 25:2 2:6 5: 2 W 2 2:6 5: Sign xtend LU W Fetch (F)

Towards a Pipelined (/8) Decode Stage 7 decode stage decodes an instruction and reads out values from the register file. PC next W 2 2 2:6 5: LU W Sign xtend Fetch (F) Decode (D) Towards a Pipelined (/8) xecute Stage 8 n execute stage performs the computation using the LU. PC next W 2 2 2:6 5: LU W Sign xtend Fetch (F) Decode (D) xecute ()

Towards a Pipelined (5/8) Stage 9 PC next 2 W 2 2:6 5: Reading and writing to memory is done in the memory stage. LU W Sign xtend Fetch (F) Decode (D) xecute () (M) Towards a Pipelined (6/8) Writeback Stage Can you see a problem with the writeback? PC next 2 W 2 2:6 5: Sign xtend The results are written back to the register file in the writeback stage. LU W 2 Fetch (F) Decode (D) xecute () (M) Writeback (W)

PC next Towards a Pipelined (7/8) Writeback Stage Note that the register file is read in the decode stage, but written to in the writeback stage 2 W 2 2:6 5: Sign xtend The address must be forwarded to the correct stage! LU W 2 Fetch (F) Decode (D) xecute () (M) Writeback (W) Towards a Pipelined (8/8) nother issue 22 Can you see another issue? PC next The program counter can be updated in the wrong stage (PC increment by or when branching). Solution not shown in the slides. W W 2 2 2:6 5: Sign xtend LU Fetch (F) Decode (D) xecute () (M) Writeback (W)

2 cknowledgement: The structure and several of the good examples are derived from the book Digital Design and Computer rchitecture (2) by D. M. Harris and S. L. Harris. Five-Stage Pipeline In each cycle, a new instruction is fetched, but it takes 5 cycles to complete the instruction. In each cycle all stages are handling different instructions in parallel. 2 xample. In cycle 6, the result of the sub instruction is written back to register $t. add $s, $s, $s2 2 5 6 7 8 9 F D M sub $t, $t, $t2 F D addi $t, $, 55 F D xori $t, $t5, F and $t6, $s, $s We can fill the pipeline because there are no dependencies between instructions xercise: What is the LU doing in cycle 5? nswer: dding together values and 55

Hazards (/) Read after Write (RW) add $s, $s, $s2 The add instruction writes back the value $s in cycle 5 But $s is used in the decode phase in cycle. 2 5 25 data hazard occurs when an instruction reads a register that has not yet been written to. This kind of data hazard is called read after write (RW) hazard. sub $t, $s, $t and $t2, $t, $s xori $t, $s, 2 and will also use the wrong value for $s. xercise: For MIPS, will instruction xori result in a hazard? Stand for yes, sleep for no. nswer: No. xori is OK for MIPS, because it writes on the first part of the cycle (falling edge) and reads on the second part (rising edge) Hazards (2/) Solution : Forwarding The result from the execute stage for add can be forwarded (also called bypassing) to the execute stage for sub. add $s, $s, $s2 2 5 26 Hazard detection is implemented using a hazard detection unit that gives control signals to the datapath if data should be forwarded. sub $t, $s, $t and $t2, $t, $s xori $t, $s, 2 Can all data hazards be solved using forwarding? The and instruction s hazard is solved by forwarding as well.

Hazards (/) Solution : Forwarding (partially) 27 xercise: Which of the instructions sub, and, and xori have data hazards? Which can be solved using forwarding? nswer: Hazards: sub and and Can use forwarding: and 2 5 The sub instruction cannot be solved using forwarding because the memory access is available at the end of cycle, but is needed in the beginning of cycle. lw $s, 2($s2) sub $t, $s, $t and $t2, $t, $s xori $t, $s, 2 The and instruction memory result can be forwarded after the memory stage to execution. xori can read the data from the write stage (writes in first part of cycle, reads in second part) Hazards (/) Solution 2: Stalling 28 Solution when forwarding does not work: stalling fter stalling, the result can be forwarded to the execute stage. 2 5 lw $s, 2($s2) sub $t, $s, $t and $t2, $t, $s xori $t, $s, 2 F D D M W F We need to stall the pipeline. Stages are repeated and the fetch of xori is delayed. Stalling results in more than one cycle per instruction. The unused stage is called a bubble.

(/5) ssume Branch Not Taken 29 2: beq $s, $s2, 2: sub $t, $s, $t 2 5 Computes the branch target address and compares for equality in the execute () stage. If branch taken, update the PC in the memory (M) stage. 28: and $t2, $t, $s 2C: xori $t, $s, 2... 6: addi $t, $s, If the branch is taken, we need to flush the pipeline. We have a branch misprediction penalty of cycles. Can we improve this? (2/5) Improving the Pipeline dd an equality comparison for beq in the decode phase (not shown here) PC next 2 W 2 2:6 5: Sign xtend Move the branch address calculation to the decode stage LU Right now, branch comparison is done in the execute stage W Fetch (F) Decode (D) xecute () (M) Writeback (W)

(/5) ssume Branch Not Taken 2: beq $s, $s2, 2: sub $t, $s, $t 2 5 The decode phase can change the next PC, so that the instruction at the branch taken address is fetched. 28: and $t2, $t, $s 2C: xori $t, $s, 2... 6: addi $t, $s, Branch misprediction penalty is now reduced to cycle. Note that we may now introduce another data hazard (if operands are not available in the decode stage). Can be solved with forwarding or stalling (/5) Deeper Pipelines Why do we sometimes want more stages than 5? The critical path can be shorter with less logic in the slowest stage. The processor can have higher clock frequency. For instance, Intel s Core 2 duo has more than pipeline stages. Why not always have more pipeline stages? dds hardware (registers) The branch mispredication penalty increases!

(/5) Deeper Pipelines How can we handle deep pipelines, Static Branch Predictors and minimize misprediction? Statically (at compile time) determine if a branch is taken or not. For instance, predict branch not taken. Dynamic Branch Predictors Dynamically (at runtime) predict if a branch will be taken or note. Operates in the fetch state. Maintains a table, called the branch target buffer, that contains hundreds or thousands of executed branch instructions, their destinations, and information if the branches were taken or not. by Raysonho @ Open Grid Scheduler / Grid ngine - Own work. Licensed under Creative Commons Zero, Public Domain

5 rmv7 The most popular IS for embedded devices (9 billon devices in 2, growth 2 billion a year) More complex addressing modes than MIPS (can do shift and add of addresses in registers in one instruction) RMv7 Condition results are saved in special flags: negative, zero, carry, overflow. 6 registers, each -bit (integers) size -bit (Thumb-mode, 6-bits encoding). Conditional execution of instructions, depending on condition code. xample: RM Cortex-8, a processor at GHz, -stage pipeline, with branch predictor. 6 Standard in laptops, PCs, and in the cloud CISC instructions are more powerful than for RM and MIPS, but requires more complex hardware architecture has evolved over the last 5 years, There are 6,, and 6 bits variants. 8 general purpose registers (eax, ebx, ecx, edx, esp, ebp, esi, edi). Variable length of instruction encoding (between and 5 bytes) rithmetic operations allow destination operand to be in memory. Major manufacturers are Intel and MD.

7 Summary Some key take away points: Pipelining is a temporal way of achieving parallelism Pipelining processors improve performance by reducing the clock period (shorter critical path) Pipelining introduces pipeline hazards. There are two main kind of hazards: data hazards and control hazards. hazards are solved by forwarding or stalling Control hazards are solved by flushing the pipeline and improved by branch prediction. Thanks for listening!