Introduction to Pipelined Datapath

Similar documents
Overview of Today s Lecture: Cost & Price, Performance { 1+ Administrative Matters Finish Lecture1 Cost and Price Add/Drop - See me after class

CO Computer Architecture and Programming Languages CAPL. Lecture 18 & 19

Computer Science 141 Computing Hardware

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

Performance, Cost and Amdahl s s Law. Arquitectura de Computadoras

Processor (multi-cycle)

ECE C61 Computer Architecture Lecture 2 performance. Prof. Alok N. Choudhary.

CPE 335. Basic MIPS Architecture Part II

Chapter 4 The Processor 1. Chapter 4A. The Processor

Computer Science 141 Computing Hardware

Computer Performance Evaluation: Cycles Per Instruction (CPI)

RISC Design: Multi-Cycle Implementation

Lecture 7 Pipelining. Peng Liu.

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

Computer Architecture. Lecture 6.1: Fundamentals of

Lecture 2: Computer Performance. Assist.Prof.Dr. Gürhan Küçük Advanced Computer Architectures CSE 533

Multiple Cycle Data Path

14:332:331 Pipelined Datapath

CPE 335 Computer Organization. Basic MIPS Pipelining Part I

Processor: Multi- Cycle Datapath & Control

Multi-cycle Approach. Single cycle CPU. Multi-cycle CPU. Requires state elements to hold intermediate values. one clock cycle or instruction

ECE 313 Computer Organization FINAL EXAM December 14, This exam is open book and open notes. You have 2 hours.

CpE 442 Introduction to Computer Architecture. The Role of Performance

Inf2C - Computer Systems Lecture 12 Processor Design Multi-Cycle

ECS 154B Computer Architecture II Spring 2009

RISC Processor Design

CSE 2021 COMPUTER ORGANIZATION

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

CENG 3420 Lecture 06: Pipeline

COMP2611: Computer Organization. The Pipelined Processor

RISC Architecture: Multi-Cycle Implementation

CSE 2021 COMPUTER ORGANIZATION

ECE331: Hardware Organization and Design

Multicycle conclusion

The Processor. Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut. CSE3666: Introduction to Computer Architecture

ECE369. Chapter 5 ECE369

RISC Architecture: Multi-Cycle Implementation

Lecture 5 and 6. ICS 152 Computer Systems Architecture. Prof. Juan Luis Aragón

Pipelined Processor Design

ECE 154A Introduction to. Fall 2012

EITF20: Computer Architecture Part2.2.1: Pipeline-1

4. What is the average CPI of a 1.4 GHz machine that executes 12.5 million instructions in 12 seconds?

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

LECTURE 6. Multi-Cycle Datapath and Control

Beyond Pipelining. CP-226: Computer Architecture. Lecture 23 (19 April 2013) CADSL

COSC 6385 Computer Architecture - Pipelining

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Single-Cycle Disadvantages & Advantages

Points available Your marks Total 100

CPE 335 Computer Organization. Basic MIPS Architecture Part I

Lecture 9. Pipeline Hazards. Christos Kozyrakis Stanford University

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Full Datapath. CSCI 402: Computer Architectures. The Processor (2) 3/21/19. Fengguang Song Department of Computer & Information Science IUPUI

CS61C Performance. Lecture 23. April 21, 1999 Dave Patterson (http.cs.berkeley.edu/~patterson)

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Design of the MIPS Processor (contd)

Major CPU Design Steps

Lecture Topics. Announcements. Today: Single-Cycle Processors (P&H ) Next: continued. Milestone #3 (due 2/9) Milestone #4 (due 2/23)

Systems Architecture I

Pipelined Processor Design

ENE 334 Microprocessors

Lecture 8: Control COS / ELE 375. Computer Architecture and Organization. Princeton University Fall Prof. David August

Topic #6. Processor Design

Lecture 6 Datapath and Controller

The Von Neumann Computer Model

CENG 3420 Computer Organization and Design. Lecture 06: MIPS Processor - I. Bei Yu

CSCI 402: Computer Architectures. Fengguang Song Department of Computer & Information Science IUPUI. Today s Content

THE HONG KONG UNIVERSITY OF SCIENCE & TECHNOLOGY Computer Organization (COMP 2611) Spring Semester, 2014 Final Examination

CENG 3420 Lecture 06: Datapath

Single vs. Multi-cycle Implementation

ALUOut. Registers A. I + D Memory IR. combinatorial block. combinatorial block. combinatorial block MDR

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Design of the MIPS Processor

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

CS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin. School of Information Science and Technology SIST

Pipelining. Maurizio Palesi

Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

EECE 417 Computer Systems Architecture

Chapter 5: The Processor: Datapath and Control

Processor Architecture

Lecture 3. Pipelining. Dr. Soner Onder CS 4431 Michigan Technological University 9/23/2009 1

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Lecture 3: Evaluating Computer Architectures. How to design something:

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 4: Datapath and Control

Pipelining. Ideal speedup is number of stages in the pipeline. Do we achieve this? 2. Improve performance by increasing instruction throughput ...

Computer Architecture

Quantifying Performance EEC 170 Fall 2005 Chapter 4

MEASURING COMPUTER TIME. A computer faster than another? Necessity of evaluation computer performance

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding effects of underlying architecture

Multicycle Approach. Designing MIPS Processor

CSE 141 Computer Architecture Spring Lectures 11 Exceptions and Introduction to Pipelining. Announcements

ﻪﺘﻓﺮﺸﻴﭘ ﺮﺗﻮﻴﭙﻣﺎﻛ يرﺎﻤﻌﻣ MIPS يرﺎﻤﻌﻣ data path and ontrol control

Computer Architecture s Changing Definition

Lets Build a Processor

Modern Computer Architecture

CS 4200/5200 Computer Architecture I

CSE 141 Computer Architecture Summer Session I, Lecture 3 Performance and Single Cycle CPU Part 1. Pramod V. Argade

ECE473 Computer Architecture and Organization. Pipeline: Control Hazard

ECE 313 Computer Organization FINAL EXAM December 11, Multicycle Processor Design 30 Points

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Transcription:

14:332:331 Computer Architecture and Assembly Language Week 12 Introduction to Pipelined Datapath [Adapted from Dave Patterson s UCB CS152 slides and Mary Jane Irwin s PSU CSE331 slides] 331 W12.1

Review: Multicycle Data and Control Path PCWriteCond PCWrite IorD MemRead MemWrite MemtoReg IRWrite Control FSM SrcA RegWrite RegDst PCSource Op SrcB PC 0 1 Address Write Data Memory Read Data (Instr. or Data) IR MDR 0 1 1 0 Instr[31-26] Instr[15-0] Instr[5-0] Read Addr 1 Register Read Read Addr 2 Data 1 File Write Addr Read Data 2 Write Data Sign Extend 32 Instr[25-0] Shift left 2 A B 4 0 1 0 1 2 3 PC[31-28] Shift left 2 zero control 28 out 2 0 1 331 W12.2

Review: RTL Summary Step R-type Mem Ref Branch Jump Instr fetch Decode IR = Memory[PC]; PC = PC + 4; A = Reg[IR[25-21]]; B = Reg[IR[20-16]]; Out = PC +(sign-extend(ir[15-0])<< 2); Execute Out = A op B; Memory access Writeback Reg[IR[15-11]] = Out; Out = A + sign-extend (IR[15-0]); MDR = Memory[Out]; or Memory[Out] = B; Reg[IR[20-16]] = MDR; if (A==B) PC = Out; PC = PC[31-28] (IR[25-0] << 2); 331 W12.3

Review: Multicycle Datapath FSM Unless otherwise assigned Start PCWrite,IRWrite, MemWrite,RegWrite=0 others=x 2 SrcA=1 SrcB=10 Op=00 PCWriteCond=0 3 (Op = lw) MemRead IorD=1 PCWriteCond=0 (Op = sw) Memory Access (Op = lw or sw) Execute 0 IorD=0 Instr Fetch 1 5 MemWrite IorD=1 PCWriteCond=0 MemRead;IRWrite SrcA=0 srcb=01 PCSource,Op=00 PCWrite 6 SrcA=1 SrcB=00 Op=10 PCWriteCond=0 7 RegDst=1 RegWrite MemtoReg=0 PCWriteCond=0 (Op = R-type) 8 9 SrcA=1 SrcB=00 Op=01 PCSource=01 PCWriteCond (Op = beq) (Op = j) Decode SrcA=0 SrcB=11 Op=00 PCWriteCond=0 PCSource=10 PCWrite 4 RegDst=0 RegWrite MemtoReg=1 PCWriteCond=0 Write Back 331 W12.4

Review: FSM Implementation Combinational control logic Inputs Outputs PCWrite PCWriteCond IorD MemRead MemWrite IRWrite MemtoReg PCSource Op SourceB SourceA RegWrite RegDst Op5 Op4 Op3 Op2 Op1 Op0 Inst[31-26] System Clock State Reg Next State 331 W12.5

Clk Single Cycle Disadvantages & Advantages Uses the clock cycle inefficiently the clock cycle must be timed to accommodate the slowest instruction Single Cycle Implementation: Cycle 1 Cycle 2 lw sw Waste Is wasteful of area since some functional units must (e.g., adders) be duplicated since they can not be shared during a clock cycle but Is simple and easy to understand 331 W12.6

Multicycle Advantages & Disadvantages Uses the clock cycle efficiently the clock cycle is timed to accommodate the slowest instruction step balance the amount of work to be done in each step restrict each step to use only one major functional unit Multicycle implementations allow but functional units to be used more than once per instruction as long as they are used on different clock cycles faster clock rates different instructions to take a different number of clock cycles Requires additional internal state registers, muxes, and more complicated (FSM) control 331 W12.7

The Five Stages of Load Instruction Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 lw IFetch Dec Exec Mem WB IFetch: Instruction Fetch and Update PC Dec: Registers Fetch and Instruction Decode Exec: Execute R-type; calculate memory address Mem: Read/write the data from/to the Data Memory WB: Write the data back to the register file 331 W12.8

Single Cycle vs. Multiple Cycle Timing Single Cycle Implementation: Clk Cycle 1 Cycle 2 lw sw Waste Multiple Cycle Implementation: multicycle clock slower than 1/5 th of single cycle clock due to stage flipflop overhead Clk Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9Cycle 10 lw IFetch Dec Exec Mem WB sw IFetch Dec Exec Mem R-type IFetch 331 W12.9

Pipelined MIPS Processor Start the next instruction while still working on the current one improves throughput - total amount of work done in a given time Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 lw IFetch Dec Exec Mem WB sw IFetch Dec Exec Mem WB R-type IFetch Dec Exec Mem WB instruction latency (execution time, delay time, response time) is not reduced - time from the start of an instruction to its completion 331 W12.10

Single Cycle, Multiple Cycle, vs. Pipeline Single Cycle Implementation: Clk Cycle 1 Cycle 2 Load Store Waste Multiple Cycle Implementation: Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9Cycle 10 Clk lw IFetch Dec Exec Mem WB sw IFetch Dec Exec Mem R-type IFetch Pipeline Implementation: lw IFetch Dec Exec Mem WB wasted cycle sw IFetch Dec Exec Mem WB R-type IFetch Dec Exec Mem WB 331 W12.11

Pipelining the MIPS ISA What makes it easy all instructions are the same length (32 bits) few instruction formats (three) with symmetry across formats memory operations can occur only in loads and stores operands must be aligned in memory so a single data transfer requires only one memory access What makes it hard structural hazards: what if we had only one memory control hazards: what about branches data hazards: what if an instruction s input operands depend on the output of a previous instruction 331 W12.12

MIPS Pipeline Datapath Modifications What do we need to add/modify in our MIPS datapath? State registers between pipeline stages to isolate them IFetch Dec Exec Mem WB 1 0 Add PC 4 Instruction Memory Read Address IFetch/Dec Read Addr 1 Register Read Read Addr 2Data 1 File Write Addr Read Data 2 Write Data Dec/Exec Shift left 2 0 1 Add Exec/Mem Data Address Memory Write Data Read Data Mem/WB 1 0 System Clock Sign 16 Extend 32 331 W12.13

MIPS Pipeline Control Path Modifications All control signals are determined during Decode and held in the state registers between pipeline stages 1 IFetch Dec Exec Mem WB 0 Control Add PC 4 Instruction Memory Read Address IFetch/Dec Read Addr 1 Register Read Read Addr 2Data 1 File Write Addr Read Data 2 Write Data Dec/Exec Shift left 2 0 1 Add Exec/Mem Data Address Memory Write Data Read Data Mem/WB 1 0 System Clock Sign 16 Extend 32 331 W12.14

Graphically Representing MIPS Pipeline Can help with answering questions like: how many cycles does it take to execute this code? what is the doing during cycle 4? is there a hazard, why does it occur, and how can it be fixed? 331 W12.15

Why Pipeline? For Throughput! Time (clock cycles) I n s t r. Inst 0 Inst 1 Once the pipeline is full, one instruction is completed every cycle O r d e r Inst 2 Inst 3 Inst 4 Time to fill the pipeline 331 W12.16

Can pipelining get us into trouble? Yes: Pipeline Hazards structural hazards: attempt to use the same resource by two different instructions at the same time data hazards: attempt to use item before it is ready - instruction depends on result of prior instruction still in the pipeline control hazards: attempt to make a decision before condition is evaulated - branch instructions Can always resolve hazards by waiting pipeline control must detect the hazard take action (or delay action) to resolve hazards 331 W12.17

A Unified Memory Would Be a Structural Hazard Time (clock cycles) I n s t r. lw Inst 1 Mem Reg Mem Reg Mem Reg Mem Reg Reading data from memory O r d e r Inst 2 Inst 3 Mem Reg Mem Reg Mem Reg Mem Reg Inst 4 Reading instruction from memory Mem Reg Mem Reg 331 W12.18

How About Register File Access? Time (clock cycles) I n s t r. O r d e r add Inst 1 Inst 2 add Inst 4 Can fix register file access hazard by doing reads in the second half of the cycle and writes in the first half. 331 W12.19

Register Usage Can Cause Data Hazards Dependencies backward in time cause hazards I n s t r. O r d e r add r1,r2,r3 sub r4,r1,r5 and r6,r1,r7 or r8, r1, r9 xor r4,r1,r5 331 W12.20

One Way to Fix a Data Hazard I n s t r. add r1,r2,r3 stall Can fix data hazard by waiting stall but affects throughput O r d e r stall sub r4,r1,r5 and r6,r1,r7 331 W12.21

Another Way to Fix a Data Hazard I n s t r. add r1,r2,r3 sub r4,r1,r5 Can fix data hazard by forwarding results as soon as they are available to where they are needed O r d e r and r6,r1,r7 or r8, r1, r9 xor r4,r1,r5 331 W12.22

Loads Can Cause Data Hazards Dependencies backward in time cause hazards I n s t r. O r d e r lw r1,100(r2) sub r4,r1,r5 and r6,r1,r7 or r8, r1, r9 xor r4,r1,r5 331 W12.23

Stores Can Cause Data Hazards Dependencies backward in time cause hazards I n s t r. O r d e r add r1,r2,r3 sw r1,100(r5) and r6,r1,r7 or r8, r1, r9 xor r4,r1,r5 331 W12.24

Forwarding with Load-use Data Hazards 331 W12.25

Branch Instructions Cause Control Hazards Dependencies backward in time cause hazards I n s t r. O r d e r add beq lw Inst 3 Inst 4 331 W12.26

One Way to Fix a Control Hazard I n s t r. add beq Can fix branch hazard by waiting stall but affects throughput O r d e r stall stall lw Inst 3 331 W12.27

Other Pipeline Structures Are Possible What about (slow) multiply operation? let it take two cycles MUL What if the data memory access is twice as slow as the instruction memory? make the clock twice as slow or let data memory access take two cycles (and keep the same clock rate) IM Reg DM1 DM2 Reg 331 W12.28

Sample Pipeline Alternatives ARM7 IM Reg EX PC update IM access decode reg access op DM access shift/rotate commit result (write back) StrongARM-1 XScale PC update BTB access start IM access IM1 IM2 Reg DM1 Reg SHFT DM2 IM access decode reg 1 access op shift/rotate reg 2 access DM write reg write start DM access exception 331 W12.29

Summary All modern day processors use pipelining Pipelining doesn t help latency of single task, it helps throughput of entire workload Multiple tasks operating simultaneously using different resources Potential speedup = Number of pipe stages Pipeline rate limited by slowest pipeline stage Unbalanced lengths of pipe stages reduces speedup Time to fill pipeline and time to drain it reduces speedup Must detect and resolve hazards Stalling negatively affects throughput 331 W12.30

Performance Purchasing perspective given a collection of machines, which has the - best performance? - least cost? - best performance / cost? Design perspective faced with design options, which has the - best performance improvement? - least cost? Both require - best performance / cost? basis for comparison metric for evaluation Our goal is to understand cost & performance implications of architectural choices 331 W12.31

Two notions of performance Plane DC to Paris Speed Passengers Throughput (pmph) Boeing 747 6.5 hours 610 mph 470 286,700 BAD/Sud Concodre 3 hours 1350 mph 132 178,200 Which has higher performance? Time to do the task (Execution Time) execution time, response time, latency Tasks per day, hour, week, sec, ns... (Performance) throughput, bandwidth Response time and throughput often are in opposition 331 W12.32

Definitions Performance is in units of things-per-second bigger is better If we are primarily concerned with response time performance(x) = 1 execution_time(x) " X is n times faster than Y" means Performance(X) n = ---------------------- Performance(Y) 331 W12.33

Example Time of Concorde vs. Boeing 747? Concord is 1350 mph / 610 mph = 2.2 times faster = 6.5 hours / 3 hours Throughput of Concorde vs. Boeing 747? Concord is 178,200 pmph / 286,700 pmph = 0.62 times faster Boeing is 286,700 pmph / 178,200 pmph = 1.6 times faster Boeing is 1.6 times ( 60% )faster in terms of throughput Concord is 2.2 times ( 120% ) faster in terms of flying time We will focus primarily on execution time for a single job 331 W12.34

Basis of Evaluation Pros representative portable widely used improvements useful in reality Actual Target Workload Full Application Benchmarks Cons very specific non-portable difficult to run, or measure hard to identify cause less representative easy to run, early in design cycle identify peak capability and potential bottlenecks 331 W12.35 Small Kernel Benchmarks Microbenchmarks easy to fool peak may be a long way from application performance

SPEC95 Eighteen application benchmarks (with inputs) reflecting a technical computing workload Eight integer go, m88ksim, gcc, compress, li, ijpeg, perl, vortex Ten floating-point intensive tomcatv, swim, su2cor, hydro2d, mgrid, applu, turb3d, apsi, fppp, wave5 Must run with standard compiler flags eliminate special undocumented incantations that may not even generate working code for real programs 331 W12.36

Metrics of performance Application Answers per month Useful Operations per second Programming Language Compiler ISA (millions) of Instructions per second MIPS (millions) of (F.P.) operations per second MFLOP/s Datapath Control Function Units Transistors Wires Pins Megabytes per second Cycles per second (clock rate) Each metric has a place and a purpose, and each can be misused 331 W12.37

Aspects of CPU Performance CPU CPU time time = Seconds = Instructions x Cycles Cycles x Seconds Program Program Instruction Cycle Cycle Program instr. count CPI clock rate Compiler Instr. Set Arch. Organization Technology 331 W12.38

CPI Average cycles per instruction CPI = (CPU Time * Clock Rate) / Instruction Count = Clock Cycles / Instruction Count CPU time = ClockCycleTime * CPI * I i i i = 1 n n CPI = CPI * F where F = I i i i i i = 1 Instruction Count "instruction frequency" Invest Resources where time is Spent! 331 W12.39

Example (RISC processor) Base Machine (Reg / Reg) Op Freq Cycles CPI(i) % Time 50% 1.5 23% Load 20% 5 1.0 45% Store 10% 3.3 14% Branch 20% 2.4 18% Typical Mix 2.2 How much faster would the machine be is a better data cache reduced the average load time to 2 cycles? How does this compare with using branch prediction to shave a cycle off the branch time? What if two instructions could be executed at once? 331 W12.40

Amdahl's Law Speedup due to enhancement E: ExTime w/o E Performance w/ E Speedup(E) = -------------------- = --------------------- ExTime w/ E Performance w/o E Suppose that enhancement E accelerates a fraction F of the task by a factor S and the remainder of the task is unaffected then, ExTime(with E) = ((1-F) + F/S) X ExTime(without E) Speedup(with E) = 1 (1-F) + F/S 331 W12.41

Summary: Evaluating Instruction Sets? Design-time metrics: Can it be implemented, in how long, at what cost? Can it be programmed? Ease of compilation? Static Metrics: How many bytes does the program occupy in memory? Dynamic Metrics: How many instructions are executed? How many bytes does the processor fetch to execute the program? CPI How many clocks are required per instruction? How "lean" a clock is practical? Best Metric: Time to execute the program! Inst. Count Cycle Time NOTE: this depends on instructions set, processor organization, and compilation techniques. 331 W12.42