CS:APP Chapter 4 Computer Architecture Wrap-Up Randal E. Bryant Carnegie Mellon University

Similar documents
CS:APP Chapter 4 Computer Architecture Wrap-Up Randal E. Bryant Carnegie Mellon University

Processor Architecture V! Wrap-Up!

Wrap-Up. CS:APP Chapter 4 Computer Architecture. Overview. Performance Metrics. CPI for PIPE. Randal E. Bryant. Carnegie Mellon University

Wrap-Up. Lecture 10 Computer Architecture V. Performance Metrics. Overview. CPI for PIPE (Cont.) CPI for PIPE. Clock rate

Wrap-Up. Lecture 10 Computer Architecture V. Performance Metrics. Overview. CPI for PIPE (Cont.) CPI for PIPE. Clock rate

Giving credit where credit is due

For Tuesday. Finish Chapter 4. Also, Project 2 starts today

Systems I. Pipelining II. Topics Pipelining hardware: registers and feedback paths Difficulties with pipelines: hazards Method of mitigating hazards

Where Have We Been? Logic Design in HCL. Assembly Language Instruction Set Architecture (Y86) Finite State Machines

CS:APP Chapter 4 Computer Architecture Pipelined Implementation

CS:APP Chapter 4 Computer Architecture Sequential Implementation

CS429: Computer Organization and Architecture

Data Hazard vs. Control Hazard. CS429: Computer Organization and Architecture. How Do We Fix the Pipeline? Possibilities: How Do We Fix the Pipeline?

Systems I. Pipelining IV

CS:APP Chapter 4 Computer Architecture Sequential Implementation

Sequential Implementation

Computer Science 104:! Y86 & Single Cycle Processor Design!

Michela Taufer CS:APP

Computer Science 104:! Y86 & Single Cycle Processor Design!

Computer Science 104:! Y86 & Single Cycle Processor Design!

Systems I. Datapath Design II. Topics Control flow instructions Hardware for sequential machine (SEQ)

Sequential CPU Implementation.

Giving credit where credit is due

Processor Architecture II! Sequential! Implementation!

CS:APP Chapter 4! Computer Architecture! Sequential! Implementation!

Chapter 4! Processor Architecture!!

Computer Science 104:! Y86 & Single Cycle Processor Design!

Stage Computation: Arith/Log. Ops

Systems I. Datapath Design II. Topics Control flow instructions Hardware for sequential machine (SEQ)

CS:APP Chapter 4 Computer Architecture Sequential Implementation

God created the integers, all else is the work of man Leopold Kronecker

Systems I. Pipelining IV

CISC Fall 2009

Overview. CS429: Computer Organization and Architecture. Y86 Instruction Set. Building Blocks

Background: Sequential processor design. Computer Architecture and Systems Programming , Herbstsemester 2013 Timothy Roscoe

CS429: Computer Organization and Architecture

CS:APP Guide to Y86 Processor Simulators

The ISA. Fetch Logic

Giving credit where credit is due

Foundations of Computer Systems

Computer Organization Chapter 4. Prof. Qi Tian Fall 2013

CSC 252: Computer Organization Spring 2018: Lecture 11

Review. Computer Science 104 Machine Organization & Programming

Giving credit where credit is due

Midterm Exam CSC February 2009

CISC: Stack-intensive procedure linkage. [Early] RISC: Register-intensive procedure linkage.

CPSC 313, Summer 2013 Final Exam Solution Date: June 24, 2013; Instructor: Mike Feeley

CPSC 313, 04w Term 2 Midterm Exam 2 Solutions

Pipelining 4 / Caches 1

Chapter 4. The Processor

Term-Level Verification of a Pipelined CISC Microprocessor

cs281: Introduction to Computer Systems CPUlab Datapath Assigned: Oct. 29, Due: Nov. 3

Y86 Instruction Set. SEQ Hardware Structure. SEQ Stages. Sequential Implementation. Lecture 8 Computer Architecture III.

CPSC 313, Summer 2013 Final Exam Date: June 24, 2013; Instructor: Mike Feeley

Foundations of Computer Systems

cs281: Computer Systems CPUlab ALU and Datapath Assigned: Oct. 30, Due: Nov. 8 at 11:59 pm

Instruction Set Architecture

Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN

cmovxx ra, rb 2 fn ra rb irmovq V, rb 3 0 F rb V rmmovq ra, D(rB) 4 0 ra rb D mrmovq D(rB), ra 5 0 ra rb D OPq ra, rb 6 fn ra rb jxx Dest 7 fn Dest

Meltdown and Spectre: Complexity and the death of security

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2

Architectural Performance. Superscalar Processing. 740 October 31, i486 Pipeline. Pipeline Stage Details. Page 1

COMPUTER ORGANIZATION AND DESI

CPSC 313, Winter 2016 Term 1 Sample Date: October 2016; Instructor: Mike Feeley

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

Superscalar Processors Ch 14

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

COSC 6385 Computer Architecture - Pipelining

LECTURE 3: THE PROCESSOR

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

cmovxx ra, rb 2 fn ra rb irmovq V, rb 3 0 F rb V rmmovq ra, D(rB) 4 0 ra rb D mrmovq D(rB), ra 5 0 ra rb D OPq ra, rb 6 fn ra rb jxx Dest 7 fn Dest

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Predict Not Taken. Revisiting Branch Hazard Solutions. Filling the delay slot (e.g., in the compiler) Delayed Branch

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

CPSC 121: Models of Computation. Module 10: A Working Computer

CS 261 Fall Mike Lam, Professor. CPU architecture

Changelog. Changes made in this version not seen in first lecture: 13 Feb 2018: add slide on constants and width

EITF20: Computer Architecture Part2.2.1: Pipeline-1

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

Chapter 4 Processor Architecture: Y86 (Sections 4.1 & 4.3) with material from Dr. Bin Ren, College of William & Mary

Full Datapath. Chapter 4 The Processor 2

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

Instruction Set Architecture

Write only as much as necessary. Be brief!

Instruction Set Architecture

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version:

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

F28HS Hardware-Software Interface: Systems Programming

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

CPSC 121: Models of Computation. Module 10: A Working Computer

Suggested Readings! Recap: Pipelining improves throughput! Processor comparison! Lecture 17" Short Pipelining Review! ! Readings!

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version

SEQ part 3 / HCLRS 1

HW1 Solutions. Type Old Mix New Mix Cost CPI

1 Hazards COMP2611 Fall 2015 Pipelined Processor

EITF20: Computer Architecture Part2.2.1: Pipeline-1

CS 201. Code Optimization, Part 2

Transcription:

CS:APP Chapter 4 Computer Architecture Wrap-Up Randal E. Bryant Carnegie Mellon University http://csapp.cs.cmu.edu CS:APP

Overview Wrap-Up of PIPE Design Performance analysis Fetch stage design Exceptional conditions Modern High-Performance Processors Out-of-order execution 2 CS:APP

Performance Metrics Clock rate Measured in Megahertz or Gigahertz Function of stage partitioning and circuit design Keep amount of work per stage small Rate at which instructions executed CPI: cycles per instruction On average, how many clock cycles does each instruction require? Function of pipeline design and benchmark programs E.g., how frequently are branches mispredicted? 3 CS:APP

CPI for PIPE CPI 1.0 Fetch instruction each clock cycle Effectively process new instruction almost every cycle Although each individual instruction has latency of 5 cycles CPI > 1.0 Sometimes must stall or cancel branches Computing CPI C clock cycles I instructions executed to completion B bubbles injected (C = I + B) CPI = C/I = (I+B)/I = 1.0 + B/I Factor B/I represents average penalty due to bubbles 4 CS:APP

CPI for PIPE (Cont.) B/I = LP + MP + RP LP: Penalty due to load/use hazard stalling Typical Values Fraction of instructions that are loads 0.25 Fraction of load instructions requiring stall 0.20 Number of bubbles injected each time 1 LP = 0.25 * 0.20 * 1 = 0.05 MP: Penalty due to mispredicted branches Fraction of instructions that are cond. jumps 0.20 Fraction of cond. jumps mispredicted 0.40 Number of bubbles injected each time 2 MP = 0.20 * 0.40 * 2 = 0.16 RP: Penalty due to ret instructions Fraction of instructions that are returns 0.02 Number of bubbles injected each time 3 RP = 0.02 * 3 = 0.06 Net effect of penalties 0.05 + 0.16 + 0.06 = 0.27 CPI = 1.27 (Not bad!) 5 CS:APP

Fetch Logic Revisited During Fetch Cycle 1. Select PC D icode ifun ra rb valc valp M_icode M_Bch M_valA W_icode W_valM 2. Read bytes from instruction memory 3. Examine icode to determine instruction length 4. Increment PC Instr valid Split Byte 0 Instruction memory Align Bytes 1-5 Need valc Need regids PC increment Predict PC Timing Steps 2 & 4 require significant amount of time F Select PC predpc 6 CS:APP

Standard Fetch Timing Select PC need_regids, need_valc Mem. Read Increment 1 clock cycle Must Perform Everything in Sequence Can t compute incremented PC until know how much to increment it by 7 CS:APP

A Fast PC Increment Circuit incrpc High-order 29 bits MUX 0 1 carry Low-order 3 bits Slow 29-bit incrementer 3-bit adder Fast High-order 29 bits Low-order 3 bits need_regids 0 need_valc PC 8 CS:APP

Modified Fetch Timing Select PC need_regids, need_valc 3-bit add Mem. Read MUX Incrementer Standard cycle 29-Bit Incrementer 1 clock cycle Acts as soon as PC selected Output not needed until final MUX Works in parallel with memory read 9 CS:APP

More Realistic Fetch Logic Other PC Controls Fetch Control Instr. Length Byte 0 Current Instruction Bytes 1-5 Fetch Box Instruction Cache Integrated into instruction cache Fetches entire cache block (16 or 32 bytes) Selects current instruction from current block Works ahead to fetch next block As reaches end of current block At branch target Current Block Next Block 10 CS:APP

Exceptions Conditions under which pipeline cannot continue normal operation Causes Halt instruction Bad address for instruction or data Invalid instruction Pipeline control error Desired Action Complete some instructions (Current) (Previous) (Previous) (Previous) Either current or previous (depends on exception type) Discard others Call exception handler Like an unexpected procedure call 11 CS:APP

Exception Examples Detect in Fetch Stage jmp $-1 # Invalid jump target.byte 0xFF # Invalid instruction code halt # Halt instruction Detect in Memory Stage irmovl $100,%eax rmmovl %eax,0x10000(%eax) # invalid address 12 CS:APP

Exceptions in Pipeline Processor #1 # demo-exc1.ys irmovl $100,%eax rmmovl %eax,0x10000(%eax) # Invalid address nop.byte 0xFF # Invalid instruction code 1 2 3 4 0x000: irmovl $100,%eax F D E M 0x006: rmmovl %eax,0x1000(%eax) F D E 0x00c: nop F D 0x00d:.byte 0xFF F Exception detected 5 W M E D Exception detected Desired Behavior rmmovl should cause exception 13 CS:APP

Exceptions in Pipeline Processor #2 # demo-exc2.ys 0x000: xorl %eax,%eax # Set condition codes 0x002: jne t # Not taken 0x007: irmovl $1,%eax 0x00d: irmovl $2,%edx 0x013: halt 0x014: t:.byte 0xFF # Target 1 2 3 4 5 6 7 8 9 0x000: xorl %eax,%eax 0x002: jne t 0x014: t:.byte 0xFF 0x???: (I m lost!) 0x007: irmovl $1,%eax F D E F D F M E D F W M E D F M E D W M E W M W Desired Behavior No exception should occur Exception detected 14 CS:APP

Maintaining Exception Ordering W exc icode vale valm dste dstm M exc icode Bch vale vala dste dstm E exc icode ifun valc vala valb dste dstm srca srcb D exc icode ifun ra rb valc valp F predpc Add exception status field to pipeline registers Fetch stage sets to either AOK, ADR (when bad fetch address), or INS (illegal instruction) Decode & execute pass values through Memory either passes through or sets to ADR Exception triggered only when instruction hits write back 15 CS:APP

Side Effects in Pipeline Processor # demo-exc3.ys irmovl $100,%eax rmmovl %eax,0x10000(%eax) # invalid address addl %eax,%eax # Sets condition codes 1 2 3 4 0x000: irmovl $100,%eax F D E M 0x006: rmmovl %eax,0x1000(%eax) F D E 0x00c: addl %eax,%eax F D 5 W M E Exception detected Condition code set Desired Behavior rmmovl should cause exception No following instruction should have any effect 16 CS:APP

Avoiding Side Effects Presence of Exception Should Disable State Update When detect exception in memory stage Disable condition code setting in execute Must happen in same clock cycle When exception passes to write-back stage Disable memory write in memory stage Disable condition code setting in execute stage Implementation Hardwired into the design of the PIPE simulator You have no control over this 17 CS:APP

Rest of Exception Handling Calling Exception Handler Push PC onto stack Either PC of faulting instruction or of next instruction Usually pass through pipeline along with exception status Jump to handler address Usually fixed address Defined as part of ISA Implementation Haven t tried it yet! 18 CS:APP

Modern CPU Design Register Updates Retirement Unit Register File Prediction OK? Instruction Control Fetch Control Instruction Decode Address Instruction Instructions Cache Operations Integer/ Branch General Integer FP Add FP Mult/Div Load Store Functional Units Operation Results Addr. Addr. Data Data Execution Data Cache 19 CS:APP

Instruction Control Retirement Unit Register File Instruction Control Address Fetch Control Instruction Instructions Cache Instruction Decode Operations Grabs Instruction Bytes From Memory Based on Current PC + Predicted Targets for Predicted Branches Hardware dynamically guesses whether branches taken/not taken and (possibly) branch target Translates Instructions Into Operations Primitive steps required to perform instruction Typical instruction requires 1 3 operations Converts Register References Into Tags Abstract identifier linking destination of one operation with sources of later operations 20 CS:APP

Execution Unit Register Updates Prediction OK? Operations Integer/ Branch General Integer FP Add FP Mult/Div Load Store Functional Units Operation Results Addr. Addr. Data Data Execution Execution Data Cache Multiple functional units Each can operate in independently Operations performed as soon as operands available Not necessarily in program order Within limits of functional units Control logic Ensures behavior equivalent to sequential program execution 21 CS:APP

CPU Capabilities of Pentium III Multiple Instructions Can Execute in Parallel 1 load 1 store 2 integer (one may be branch) 1 FP Addition 1 FP Multiplication or Division Some Instructions Take > 1 Cycle, but Can be Pipelined Instruction Latency Cycles/Issue Load / Store 3 1 Integer Multiply 4 1 Integer Divide 36 36 Double/Single FP Multiply 5 2 Double/Single FP Add 3 1 Double/Single FP Divide 38 38 22 CS:APP

PentiumPro Block Diagram P6 Microarchitecture PentiumPro Pentium II Pentium III Microprocessor Report 2/16/95

PentiumPro Operation Translates instructions dynamically into Uops 118 bits wide Holds operation, two sources, and destination Executes Uops with Out of Order engine Uop executed when Operands available Functional unit available Execution controlled by Reservation Stations Keeps track of data dependencies between uops Allocates resources 24 CS:APP

PentiumPro Branch Prediction Critical to Performance 11 15 cycle penalty for misprediction Branch Target Buffer 512 entries 4 bits of history Adaptive algorithm Can recognize repeated patterns, e.g., alternating taken not taken Handling BTB misses Detect in cycle 6 Predict taken for negative offset, not taken for positive Loops vs. conditionals 25 CS:APP

Example Branch Prediction Branch History Encode information about prior history of branch instructions Predict whether or not branch will be taken NT NT NT T Yes! Yes? No? No! NT State Machine T T T Each time branch taken, transition to right When not taken, transition to left Predict branch taken when in state Yes! or Yes? 26 CS:APP

Pentium 4 Block Diagram Intel Tech. Journal Q1, 2001 Next generation microarchitecture 27 CS:APP

Pentium 4 Features Trace Cache L2 Cache IA32 Instrs. Instruct. Decoder uops Trace Cache Operations Replaces traditional instruction cache Caches instructions in decoded form Reduces required rate for instruction decoder Double-Pumped ALUs Simple instructions (add) run at 2X clock rate Very Deep Pipeline 20+ cycle branch penalty Enables very high clock rates Slower than Pentium III for a given clock rate 28 CS:APP

Processor Summary Design Technique Create uniform framework for all instructions Want to share hardware among instructions Connect standard logic blocks with bits of control logic Operation State held in memories and clocked registers Computation done by combinational logic Clocking of registers/memories sufficient to control overall behavior Enhancing Performance Pipelining increases throughput and improves resource utilization Must make sure maintains ISA behavior 29 CS:APP