These actions may use different parts of the CPU. Pipelining is when the parts run simultaneously on different instructions.

Similar documents
COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Chapter 4. The Processor

Pipelining. CSC Friday, November 6, 2015

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESI

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

Design of the MIPS Processor

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 4: Datapath and Control

Processor: Multi- Cycle Datapath & Control

Chapter 4. The Processor

Computer Science 324 Computer Architecture Mount Holyoke College Fall Topic Notes: MIPS Instruction Set Architecture

Design of the MIPS Processor (contd)

CPE 335. Basic MIPS Architecture Part II

ECE260: Fundamentals of Computer Engineering

Computer Science 324 Computer Architecture Mount Holyoke College Fall Topic Notes: MIPS Instruction Set Architecture

ECE369. Chapter 5 ECE369

LECTURE 3: THE PROCESSOR

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

COMPUTER ORGANIZATION AND DESIGN

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

Topic #6. Processor Design

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.

TDT4255 Computer Design. Lecture 4. Magnus Jahre. TDT4255 Computer Design

Topic Notes: MIPS Instruction Set Architecture

Chapter 4. The Processor

Multicycle Approach. Designing MIPS Processor

BASIC COMPUTER ORGANIZATION. Operating System Concepts 8 th Edition

Processor design - MIPS

ECE 486/586. Computer Architecture. Lecture # 7

Lecture 4: Instruction Set Architecture

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Instruction Set Principles. (Appendix B)

Processor (IV) - advanced ILP. Hwansoo Han

The Processor: Instruction-Level Parallelism

COSC 6385 Computer Architecture - Pipelining

The Processor. Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut. CSE3666: Introduction to Computer Architecture

Pipelining. Maurizio Palesi

ECEC 355: Pipelining

Preventing Stalls: 1

CS/COE1541: Introduction to Computer Architecture

Instruction Pipelining

Chapter 5: The Processor: Datapath and Control

What is Pipelining? RISC remainder (our assumptions)

Instruction Pipelining

Pipelining! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar DEIB! 30 November, 2017!

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

CENG 3531 Computer Architecture Spring a. T / F A processor can have different CPIs for different programs.

Computer Architecture. Lecture 6.1: Fundamentals of

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

Processor (II) - pipelining. Hwansoo Han

Lecture 5 and 6. ICS 152 Computer Systems Architecture. Prof. Juan Luis Aragón

Computer Science 324 Computer Architecture Mount Holyoke College Fall Topic Notes: Data Paths and Microprogramming

Chapter 4 The Processor 1. Chapter 4A. The Processor

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

ENE 334 Microprocessors

Chapter 4 The Processor (Part 2)

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

CISC / RISC. Complex / Reduced Instruction Set Computers

Advanced Computer Architecture

Full Datapath. Chapter 4 The Processor 2

CAD for VLSI 2 Pro ject - Superscalar Processor Implementation

The Processor: Datapath & Control

MIPS ISA AND PIPELINING OVERVIEW Appendix A and C

CC 311- Computer Architecture. The Processor - Control

5008: Computer Architecture HW#2

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

Implementing the Control. Simple Questions

Full Datapath. Chapter 4 The Processor 2

CS 61C: Great Ideas in Computer Architecture. Lecture 13: Pipelining. Krste Asanović & Randy Katz

CSEE 3827: Fundamentals of Computer Systems

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Chapter 4. The Processor

Lecture 8: Control COS / ELE 375. Computer Architecture and Organization. Princeton University Fall Prof. David August

Processor (I) - datapath & control. Hwansoo Han

Final Exam Fall 2007

C.1 Introduction. What Is Pipelining? C-2 Appendix C Pipelining: Basic and Intermediate Concepts

CS 230 Practice Final Exam & Actual Take-home Question. Part I: Assembly and Machine Languages (22 pts)

EIE/ENE 334 Microprocessors

ECE 154A Introduction to. Fall 2012

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

CSE 2021 COMPUTER ORGANIZATION

Parallelism. Execution Cycle. Dual Bus Simple CPU. Pipelining COMP375 1

ECE331: Hardware Organization and Design

ECE331: Hardware Organization and Design

Computer Architecture

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

Comprehensive Exams COMPUTER ARCHITECTURE. Spring April 3, 2006

Thomas Polzer Institut für Technische Informatik

Where Does The Cpu Store The Address Of The

Pipelined CPUs. Study Chapter 4 of Text. Where are the registers?

CSEE 3827: Fundamentals of Computer Systems

Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version

A Model RISC Processor. DLX Architecture

Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions.

October 24. Five Execution Steps

Transcription:

MIPS Pipe Line

2 Introduction Pipelining To complete an instruction a computer needs to perform a number of actions. These actions may use different parts of the CPU. Pipelining is when the parts run simultaneously on different instructions. It is a vital technique in the quest for more powerful computers. Clock rate is technology Pipelining is the clever use of that technology. Pipelining predates the retreat from speed by Intel and AMD Assembly line: different stages are completing different steps on different objects. Each stage is a pipe stage or segment The pipeline connects them all. Pipelining does not increase the speed at which the first instruction completes. Pipelining increases the number of instructions which finish per second (in the steady state)

3 Principles Design overview is the core of the design Only use hardware, where there are net performance gains. Principles: Simple instructions and few addressing modes: CISC includes many ways to address memory. May need several parameters, uses microcode and may need several cycles just to calculate an address. RISC has simple address modes. Every instruction needs only one cycle per pipeline stage. Direct, pointers, offset Register-Register (Load/Store)design: The only way registers and memory interact is via a load or store operation. All other operations involve only registers. CISC supports arithmeticologic operations on memory

4 Principles Principles: Pipelining: Multistage pipeline which allows the CPU to perform more than one instruction at a time. The predictability (and similarity) of the time for all instructions aids in creating an efficient pipeline. Hardware control no (or a little) microcode: No micro-coded ROM to execute complex instructions. All instructions directly in hardware for speed (and simplicity) Reliance on optimising compilers: Optimising Compilers don t just create low level instructions to implement the high level constructs. Reorder instructions, use of the registers to minimise memory accesses. Simplicity of opcodes, consistency of timings and absence of complex addressing modes. Not true of VAX, nor Inte All ease problem of compiler writing

5 Principles Principles: High performance Memory Hierarchy: Need to keep pace with CPU. Introduce memory/cache hierarchy including large number of registers Fast static RAM split cache. D-cache data cache I-cache instruction cache Write buffers on chip memory management.

6 Simple RISC Notes ALU connects to the buses and thence to the register file of 32 GPRs Only load/store connect registers with the D-Cache Instruction fetch: fetches a single instruction per cycle from the I-Cache at an address given by the PC. Instruction fetch is controlled by the pipeline decode and control unit. The PC is incremented by 4 after each fetch. (Byte addressable and 32 bit words). PC can be loaded by a jump or branch target address. Aim: 1 instruction per clock cycle. Achievement depends on cache and pipelining.

7 Clock Clock Assembly line all items pass onto the next stage at the same time. all instructions pass onto the next stage at the same time. The time that each stage takes is a Processor cycle The cycle must leave time for the slowest stage to complete. Need to balance the work done on each cycle. Processor cycle is usually one clock cycle of the machine, sometimes two. Not more on RISC machines In a perfectly balanced pipeline the instruction throughput is just p times the unpipelined machine. Where p is the number of stages. overhead CPI measure of performance. Decrease in average time per completed instruction. Often measure as Clock Cycles per Instruction Needs no input from the programmer to work

8 Basic RISC RISC architecture Not comprehensive review Data Operations only on registers Only load and store operations on memory half and double word options Few instructions all 1 word 32 Integer General Purpose Registers (GPR) Instruction Set The value will fill 32 or 64 bits. ALU : Two registers to a third Register & signed extended immediate Ops include ADD, SUB, AND, OR. arithmetic ADDU Immediate versions of the ops Signed and unsigned Immediate Actual value #3 Load/ Store address. or sink. Register source (base register) and an immediate field (offset). Sum makes the effective Second register is the source Branch/ Conditional transfer of control Jumps Unconditional transfer. Destination is a signed extended offset added to the PC Will consider register comparisons BNE

9 Instruction Simple Execution Cycle 1. Instruction Fetch (IF). Send the Program Counter (PC) to memory and fetch next instruction. Update PC by adding 4. Instructions are 4 bytes Things that can be done without penalty Write to cache 2. Instruction Decode (ID) / Register Fetch a.decode the instruction b.read the registers c.equality test on registers as read d.sign extend the offset field e.compute possible branch target address by adding offset to PC Decode in parallel with Register, because the register specifiers are in a fixed place in the word fixed-field decoding May not need it, but it takes no extra time. Also calculate sign extended immediate. 3. Execution/effective address cycle (EX) ALU operates on operands prepared in 2 a.memory ref: Base register + offset to give effective address b.register-register execute op code c.register-immediate execute op code 4. Memory Access (MEM): Load, read using effective address, write the data from the second register using the first effective address from the first register 5. Write-back cycle (WB): Write the result into the register whether from the memory or the ALU In case required In case required Internal parallel Branch: 2 cycles Store: 4 cycles Others; 5 cycles

10 Simple RISC Block diagram ALU only talks to the register buses decode and control ALU Register File PC Instruction Fetch Result Bus Operand Bus A Operand Bus B MMU MMU Data and instructions with separate paths and cache D I-cache A D D-cache A A: Address Line D: Data Line PC: Program Counter Main Memory

11 Overview This pipeline has five stages. n stage pipeline Instruction Fetch Instruction Decode/Register fetch Execute/ Address Calculation Memory Access Each stage should take one cycle. While the IF is fetching an instruction then Decode is decoding the previous instruction; and execute is doing the one before. No pipeline time for 1 instruction is k cycles. For n instructions non pipeline = n*k = k (for first instruction) +(n-1) (for the other n-1) = k+n-1 Speed-up = (n*k)/(k+n+1) = k/(k/n + (1+1/n) = k (as n goes to infinity) Speed-up = number of stages (ignoring stalls) Write Back

12 Stages Instruction stages IF ID EX MEM WB MIPS. See H&P, P&H and various other text books Look at an instruction architecture from the pipeline It has 5 stages each one takes one clock cycle. They are IF ID EX MEM WB Instruction Fetch Instruction Decode Execute Memory Write Back

13 IF Instruction Fetch IF ID EX MEM WB IF Instruction Fetch Get the instruction from memory (or more usually cache) Instructions often in I-cache. The cache for instructions (D-cache for data) PC incremented by 4 points to next instruction. If there has been a branch or jump set the PC from that instruction. Jump non sequential alteration of the PC. Always taken. Branch non sequential alteration of the PC, conditionally taken on the basis of values in the registers.

14 ID Instruction decode & register fetch IF ID EX MEM WB ID Instruction Decode Different actions depending on the sort of instruction. Register-Register Memory reference Control transfer Register-Register: modify the values of a register depending on values in other registers. and, or, add, sub Memory reference: commonly in RISC machines and certainly here. lw load from memory to registers or sw store from register back to memory Control transfer: jump or branch

15 ID Instruction decode & register fetch IF ID EX MEM WB Register-Register 31 26 25 2120 16 15 1110 6 5 0 Op Rs1 Rs2 Rd Opx Registers always in the same place. Opcode always the same length. Can set up access to the registers, while decoding the instruction. If registers not needed no penalty. If needed already there. Repeated theme, if you can do something without penalty, do it, even if not needed. Note there is nearly always some penalty, even if it is only power

16 EX Execute IF ID EX MEM WB Here the ALU operates on the operands which have been prepared in the decode cycle Memory reference: Calculates effective address by taking Base register and adding offset Arithmetic For register register op codes execute op code perform the arithmetic operation Similarly for register

17 MEM Memory Reference IF ID EX MEM WB Load or store accesses memory. A-L writes result from ALUout register

18 WB Write Back IF ID EX MEM WB Write from place the memory reference placed the data, into the register Step R-Type Mem Ref Branches Jumps IF Instruction Decode/ Memory fetch Instruction RegisterÜ Memory(PC) PC Ü PC + 4 A Ü Reg[IR(25-21)] B Ü Reg[IR(20-16)] ALUOut Ü PC + signextend (IR(15:0)) Execution ALUOut Ü A op B ALUOut Ü A + signextend IR(15:0) Mem access R type comp Mem read completion Reg(IR(15:11) Ü ALUout Load MDRÜ Mem[ALUout] Store memory[aluout] ) Ü B Load Reg(IR(20:16) Ü MDR if (A==B) PC Ü ALUout PC Ü PC + IR(25:0) shift

19 Five stage pipe Clock speed Look at MIPS used in Hennessey & Patterson (as well as Patterson & Hennessey) and other architecture books. Clean architecture not surprising designed by an academic Easy to pipeline. Start a new instruction on each clock cycle Each cycle becomes a pipe stage. Stage IF Register Read Each stage must take the same time. So each stage must go as slow as the slowest. Clock must be 200ps. (5 GHz) ALU Data Register Write Total Time 200ps 100ps 200ps 200ps 100ps 800ps 60mm light travel

20 Five stage pipe Instruction time Many instructions follow H&P in looking at five types Instruction IF Register Read Each stage must take the same time. So each stage must go as slow as the slowest. Clock must be 200ps. (5 GHz) All instructions need to take the same time, (single cycle) so the instructions with missing stages do nothing at that point in the pipeline. All instructions take 800ps. ALU Data Register Write Total Load Word lw 200ps 100ps 200ps 200ps 100ps 800ps Store Word sw 200ps 100ps 200ps 200ps 700ps Arithmetic add,sub, 200ps 100ps 200ps 100ps 600ps Branch beq 200ps 100ps 200ps 500ps Improves throughput but not latency Non-pipelined instructions take 800ps each. d instructions finish every 200ps. The speed up is approximately 1/stages. Assuming enough instructions to render the start up cost negligible (and no pipeline stalls).

21 Pipe timing Time flow IF Reg ALU Data Access Reg write IF Reg ALU Data Access Reg write IF Reg ALU Data Access Ignoring stalls In this case the first instruction takes 900ps, but the instruction rate is still one every 200ps. n instructions take 800*n ps sequential. 900 + 200*n ps pipelined Speed up=900 + 200 =1,125 + 0.25 0.25 n 800n n If the pipeline was perfectly balanced then the speed up for an k stage pipeline executing n instructions is k as n

22 Design Designing for pipeline MIPS all instructions the same length. c.f. IA-32. Instructions vary in length. Would make pipelining very hard. But instructions translated to microcode. Microcode is MIPS like. Microcode is executed in a pipeline. Complex to preserve backward compatibility Source register in the same place in all instructions. Register file can be accessed as instruction is decoded. Called uniform decode If register position depends on instruction. Must decode first Memory operands only appear in loads or stores. Can calculate the memory address here and access in following stage. Memory operands introduce an extra stage in the pipeline. Operands must be aligned transfers can always be completed in a single stage.