Agenda. Recap: Adding branches to datapath. Adding jalr to datapath. CS 61C: Great Ideas in Computer Architecture

Similar documents
CS 61C: Great Ideas in Computer Architecture. Lecture 13: Pipelining. Krste Asanović & Randy Katz

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 1

10/11/17. New-School Machine Structures. Review: Single Cycle Instruction Timing. Review: Single-Cycle RISC-V RV32I Datapath. Components of a Computer

Programmable Machines

CS 61C: Great Ideas in Computer Architecture RISC-V Instruction Formats

9/14/17. Levels of Representation/Interpretation. Big Idea: Stored-Program Computer

Programmable Machines

Department of Electrical Engineering and Computer Sciences Fall 2017 Instructors: Randy Katz, Krste Asanovic CS61C MIDTERM 2

Lecture 4 - Pipelining

Computer Architecture. Lecture 6.1: Fundamentals of

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

Pipelining. CSC Friday, November 6, 2015


Chapter 4. The Processor

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 3

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

L19 Pipelined CPU I 1. Where are the registers? Study Chapter 6 of Text. Pipelined CPUs. Comp 411 Fall /07/07

Pipelined CPUs. Study Chapter 4 of Text. Where are the registers?

CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards

Lecture 10: Simple Data Path

The Processor. Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut. CSE3666: Introduction to Computer Architecture

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

CO Computer Architecture and Programming Languages CAPL. Lecture 18 & 19

EECS 151/251A: SPRING 17 MIDTERM 2 SOLUTIONS

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

EECS 151/251A Fall 2017 Digital Design and Integrated Circuits. Instructor: John Wawrzynek and Nicholas Weaver. Lecture 13 EE141

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

UC Berkeley CS61C : Machine Structures

CS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin. School of Information Science and Technology SIST

Full Name: NetID: Midterm Summer 2017

Chapter 4. The Processor

Chapter 4. The Processor

These actions may use different parts of the CPU. Pipelining is when the parts run simultaneously on different instructions.

CPU Performance Pipelined CPU

Advanced processor designs

COSC3330 Computer Architecture Lecture 7. Datapath and Performance

Chapter 4 The Processor 1. Chapter 4A. The Processor

CS3350B Computer Architecture Quiz 3 March 15, 2018

Truth Table! Review! Use this table and techniques we learned to transform from 1 to another! ! Gate Diagram! Today!

CSE Computer Architecture I Fall 2009 Lecture 13 In Class Notes and Problems October 6, 2009

Concocting an Instruction Set

CS 61C: Great Ideas in Computer Architecture Datapath. Instructors: John Wawrzynek & Vladimir Stojanovic

Lecture 3: Single Cycle Microarchitecture. James C. Hoe Department of ECE Carnegie Mellon University

Chapter 4 The Processor 1. Chapter 4B. The Processor

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

Computer Systems Architecture Spring 2016

CS61C : Machine Structures

Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

COSC 6385 Computer Architecture - Pipelining

Lecture 08: RISC-V Single-Cycle Implementa9on CSE 564 Computer Architecture Summer 2017

Computer Organization and Structure

COMPUTER ORGANIZATION AND DESIGN

ECE 2300 Digital Logic & Computer Organization. More Single Cycle Microprocessor

ISA and RISCV. CASS 2018 Lavanya Ramapantulu

EC 413 Computer Organization - Fall 2017 Problem Set 3 Problem Set 3 Solution

Processor Architecture

ECE331: Hardware Organization and Design

Review. N-bit adder-subtractor done using N 1- bit adders with XOR gates on input. Lecture #19 Designing a Single-Cycle CPU

CS/COE1541: Introduction to Computer Architecture

Review: Abstract Implementation View

CS 61C: Great Ideas in Computer Architecture. MIPS CPU Datapath, Control Introduction

LECTURE 3: THE PROCESSOR

Clever Signed Adder/Subtractor. Five Components of a Computer. The CPU. Stages of the Datapath (1/5) Stages of the Datapath : Overview

CS3350B Computer Architecture Winter 2015

COMPUTER ORGANIZATION AND DESIGN

10/19/17. You Are Here! Review: Direct-Mapped Cache. Typical Memory Hierarchy

COMPUTER ORGANIZATION AND DESI

RISC Pipeline. Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University. See: P&H Chapter 4.6

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

Lecture 2: RISC V Instruction Set Architecture. Housekeeping

Concocting an Instruction Set

CPS104 Computer Organization and Programming Lecture 19: Pipelining. Robert Wagner

Processor (I) - datapath & control. Hwansoo Han

Multicore and Parallel Processing

CS 230 Practice Final Exam & Actual Take-home Question. Part I: Assembly and Machine Languages (22 pts)

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Lecture 2: RISC V Instruction Set Architecture. James C. Hoe Department of ECE Carnegie Mellon University

CS 61C: Great Ideas in Computer Architecture Introduction to Assembly Language and RISC-V Instruction Set Architecture

TDT4255 Computer Design. Lecture 4. Magnus Jahre. TDT4255 Computer Design

Midterm. Sticker winners: if you got >= 50 / 67

CS 152, Spring 2011 Section 2

Computer Architecture

CISC 662 Graduate Computer Architecture. Lecture 4 - ISA

EECS150 - Digital Design Lecture 10- CPU Microarchitecture. Processor Microarchitecture Introduction

ELEC / Computer Architecture and Design Fall 2013 Instruction Set Architecture (Chapter 2)

The Processor: Instruction-Level Parallelism

CS 152 Computer Architecture and Engineering

Modern Computer Architecture

CS 61C: Great Ideas in Computer Architecture Control and Pipelining

Chapter 4. The Processor

A General-Purpose Computer The von Neumann Model. Concocting an Instruction Set. Meaning of an Instruction. Anatomy of an Instruction

Lecture 4: MIPS Instruction Set

ECE 4750 Computer Architecture, Fall 2014 T01 Single-Cycle Processors

6.823 Computer System Architecture Datapath for DLX Problem Set #2

are Softw Instruction Set Architecture Microarchitecture are rdw

Chapter 4. The Processor. Computer Architecture and IC Design Lab

Computer Science 324 Computer Architecture Mount Holyoke College Fall Topic Notes: MIPS Instruction Set Architecture

CS61C : Machine Structures

CSEN 601: Computer System Architecture Summer 2014

Design of Digital Circuits 2017 Srdjan Capkun Onur Mutlu (Guest starring: Frank K. Gürkaynak and Aanjhan Ranganathan)

Transcription:

/5/7 CS 6C: Great Ideas in Computer Architecture Lecture : Control & Operating Speed Krste Asanović & Randy Katz http://insteecsberkeleyedu/~cs6c/fa7 CS 6c Lecture : Control & Performance Recap: Adding branches to datapath Implementing JALR Instruction (I-Format) inst[:7] inst[9:5] A DataA inst[4:] B DataB inst[3:7] Imm imm[3:] Reg[rs] Reg[rs] Branch DataR DataW mem inst[3:] ImmSel RegWEn BrUnBrEq BrLT ASel BSel Sel MemRW WBSel JALR rd, rs, immediate Writes PC to Reg[rd] (return address) Sets PC = Reg[rs] + immediate Uses same immediates as arithmetic and loads no multiplication by bytes CS 6c 3 4 Adding jalr to datapath Adding jalr to datapath inst[:7] inst[9:5] inst[4:] A DataA B DataB Reg[rs] Reg[rs] Branch DataR inst[:7] inst[9:5] inst[4:] A DataA B DataB Reg[rs] Reg[rs] Branch DataR inst[3:7] Imm imm[3:] inst[3:7] Imm imm[3:] inst[3:] ImmSel RegWEn BrUnBrEq BrLT ASel BSel Sel MemRW WBSel inst[3:] ImmSel=I RegWEn= Bsel= Asel= MemRW=Read WBSel= BrUn=* BrEq=* BrLT=* Sel=Add CS 6c 5 CS 6c 6

/5/7 Implementing jal Instruction Adding jal to datapath JAL saves PC in Reg[rd] (the return address) Set PC = PC + offset (PC-relative jump) Target somewhere within ± 9 locations, bytes apart ± 8 3-bit instructions Immediate encoding optimized similarly to branch instruction to reduce hardware cost inst[:7] inst[9:5] A DataA inst[4:] B DataB inst[3:7] Imm imm[3:] Reg[rs] Reg[rs] Branch DataR DataW mem inst[3:] ImmSel RegWEn BrUnBrEq BrLT ASel BSel Sel MemRW WBSel 7 CS 6c 8 Adding jal to datapath Upper Immediate instructions inst[:7] inst[9:5] A DataA inst[4:] B DataB inst[3:7] Imm imm[3:] inst[3:] ImmSel=J RegWEn= Reg[rs] Reg[rs] Branch Bsel= BrUn=* BrEq=* BrLT=* Asel= Sel=Add DataR DataW MemRW=Read mem WBSel= Has -bit immediate in upper bits of 3-bit instruction word One destination register, rd Used for two instructions LUI Load Upper Immediate (add to zero) AUIPC Add Upper Immediate to PC CS 6c 9 Implementing lui Implementing aui inst[:7] inst[9:5] inst[4:] A DataA B DataB Reg[rs] Reg[rs] Branch DataR inst[:7] inst[9:5] inst[4:] A DataA B DataB Reg[rs] Reg[rs] Branch DataR inst[3:7] Imm imm[3:] inst[3:7] Imm imm[3:] = inst[3:] ImmSel=U RegWEn= Bsel=Asel=* Sel=B MemRW=Read WBSel= = inst[3:] ImmSel=U RegWEn= Bsel=Asel=Sel=Add MemRW= WBSel= BrUn=* BrE=* BrLT=* BrUn=* BrE=* BrLT=* CS 6c CS 6c

/5/7 Recap: Complete RV3I ISA Single-Cycle RISC-V RV3I Datapath inst[:7] inst[9:5] A DataA inst[4:] B DataB Reg[rs] Reg[rs] Branch DataR DataW mem Not in CS6C inst[3:7] Imm imm[3:] RV3I has 47 instructions total 37 instructions covered in CS6C 3 inst[3:] ImmSel RegWEn BrUnBrEq BrLT ASel BSel Sel MemRW WBSel CS 6c 4 CS 6c Lecture : Control & Performance 5 Processor Control Datapath PC Registers Arithmetic & Logic Unit () Processor Enable? Read/Write ess Write Data Read Data Memory Program Bytes Processor-Memory Interface CS 6c Lecture : Control & Performance 6 Data Single-Cycle RISC-V RV3I Datapath Control Logic Truth Table (incomplete) Inst[3:] BrEq BrLT ImmSel BrUn ASel BSel Sel MemRW RegWEn WBSel add * * * * Reg Reg Add Read sub * * * * Reg Reg Sub Read inst[:7] inst[9:5] inst[4:] A DataA B DataB Reg[rs] Reg[rs] Branch DataR (R-R Op) * * * * Reg Reg (Op) Read addi * * I * Reg Imm Add Read lw * * I * Reg Imm Add Read Mem sw * * S * Reg Imm Add Write * beq * B * PC Imm Add Read * beq * B * PC Imm Add Read * inst[3:7] Imm imm[3:] bne * B * PC Imm Add Read * bne * B * PC Imm Add Read * blt * B PC Imm Add Read * bltu * B PC Imm Add Read * inst[3:] ImmSel RegWEn BrUnBrEq BrLT ASel BSel Sel MemRW WBSel jalr * * I * Reg Imm Add Read PC Control Logic CS 6c 7 jal * * J * PC Imm Add Read PC aui * * U * PC Imm Add Read CS 6c Lecture : Control & Performance 8 3

/5/7 Control Realization Options ROM Read-Only Memory Regular structure Can be easily reprogrammed fix errors add instructions Popular when designing control logic manually Combinatorial Logic Today, chip designers use logic synthesis tools to convert truth tables to networks of gates CS 6c Lecture : Control & Performance 9 RV3I, a nine-bit ISA! Instruction type encoded using only 9 bits inst[3],inst[4:], inst[6:] inst[3] inst[4:] inst[6:] Not in CS6C ROM-based Control -bit address (inputs) Inst[3,4:,6:] BrEq BrLT 9 ROM ImmSel[:] BrUn ASel BSel Sel[3:] MemRW RegWEn WBSel[:] CS 6c Lecture : Control & Performance 3 4 5 data bits (outputs) Inst[] BrEQ BrLT ROM Controller Implementation ess Decoder add sub or jal Control Word for add Control Word for sub Control Word for or CS 6c Lecture : Control & Performance Controller output (, ImmSel, ) Administrivia Homework Due tomorrow :59 pm Project Part Due Monday Oct 9 Part due Monday Oct 6 Midterm Regrades due next Tuesday Talk to a TA if you don t understand a midterm question or are unsure of a regrade Break! CS 6c Lecture : Control & Performance 3 /5/7 4 4

/5/7 CS 6c Lecture : Control & Performance 5 Instruction Timing IF ID EX MEM WB Total I-MEM Reg Read D-MEM Reg W ps ps ps ps ps 8 ps CS 6c Lecture : Control & Performance 6 Maximum clock frequency f max = /8ps = 5 GHz Instruction Timing Instr IF = ps ID = ps = ps MEM=ps WB = ps Total add X X X X 6ps beq X X X 5ps jal X X X 5ps lw X X X X X 8ps sw X X X X 7ps Most blocks idle most of the time Eg f max, = /ps = 5 GHz! How can we keep busy all the time? 5 billion adds/sec, rather than just 5 billion? Idea: Factories use three employee shifts - equipment is always busy! CS 6c Lecture : Control & Performance 8 Performance Measures Our RISC-V executes instructions at 5 GHz instruction every 8 ps Can we improve its performance? What do we mean with this statement? Not so obvious: Quicker response time, so one job finishes faster? More jobs per unit time (eg web server returning pages)? Longer battery life? CS 6c Lecture : Control & Performance 9 Transportation Analogy Sports Car Bus Passenger Capacity 5 Travel Speed mph 5 mph Gas Mileage 5 mpg mpg 5 Mile trip: Sports Car Bus Travel Time 5 min 6 min Time for passengers 75 min min Gallons per passenger 5 gallons 5 gallons CS 6c Lecture : Control & Performance 3 5

/5/7 Transportation Trip Time Time for passengers Gallons per passenger Computer Analogy Computer Program execution time: eg time to update display Throughput: eg number of server requests handled per hour Energy per task*: eg how many movies you can watch per battery charge or energy bill for datacenter * Note: power is not a good measure, since low-power CPU might run for a long time to complete one task consuming more energy than faster computer running at higher power for a shorter time 3 Iron Law of Processor Performance Time = Instructions Cycles Time Program Program * Instruction * Cycle 3 Instructions per Program Determined by Task Algorithm, eg O(N ) vs O(N) Programming language Compiler Instruction Set Architecture (ISA) (Average) Clock cycles per Instruction Determined by ISA Processor implementation (or microarchitecture) Eg for our single-cycle RISC-V design, CPI = Complex instructions (eg strcpy), CPI >> Superscalar processors, CPI < (next lecture) CS 6c Lecture : Control & Performance 33 CS 6c Lecture : Control & Performance 34 Time per Cycle (/Frequency) Determined by Processor microarchitecture (determines critical path through logic gates) Technology (eg 4nm versus 8nm) Power budget (lower voltages reduce transistor speed) CS 6c Lecture : Control & Performance 35 Speed Tradeoff Example For some task (eg image compression) Processor A Processor B # Instructions Million 5 Million Average CPI 5 Clock rate f 5 GHz GHz Execution time ms 75 ms Processor B is faster for this task, despite executing more instructions and having a lower clock rate! CS 6c Lecture : Control & Performance 36 6

/5/7 Energy per Task Energy = Instructions Energy Program Program * Instruction Energy α Instructions * C V Program Program Energy Tradeoff Example Next-generation processor C (Moore s Law): -5 % Supply voltage, V sup : -5 % Energy consumption: - (-85) 3 = -39 % Capacitance depends on technology, processor features eg # of cores Supply voltage, eg V Significantly improved energy efficiency thanks to Moore s Law AND Reduced supply voltage Want to reduce capacitance and voltage to reduce energy/task 37 CS 6c Lecture : Control & Performance 38 39 Energy Iron Law Performance = Power * Energy Efficiency (Tasks/Second) (Joules/Second) (Tasks/Joule) Energy efficiency (eg, instructions/joule) is key metric in all computing devices For power-constrained systems (eg, MW datacenter), need better energy efficiency to get more performance at same power For energy-constrained systems (eg, W phone), need better energy efficiency to prolong battery life End of Scaling In recent years, industry has not been able to reduce supply voltage much, as reducing it further would mean increasing leakage power where transistor switches don t fully turn off (more like dimmer switch than on-off switch) Also, size of transistors and hence capacitance, not shrinking as much as before between transistor generations Power becomes a growing concern the power wall Cost-effective air-cooled chip limit around ~5W CS 6c Lecture : Control & Performance 4 Processor Trends Break! Transistors (Thousands) Frequency (MHz) Power (W) Cores 97 975 98 985 99 995 5 [Olukotun, Hammond,Sutter,Smith,Batten] 4 /5/7 4 7

/5/7 CS 6c Lecture : Control & Performance 43 A familiar example: Getting a university degree Pipelining Year Year Year 3 Year 4 Shortage of Computer scientists (your startup is growing): How long does it take to educate 6,? CS 6c Lecture : Control & Performance 44 Computer Scientist Education Option : serial 4 enter Option : pipelining year 4 graduate 4 graduate 4 graduate 6, in 6 years, average throughput is /year 4 6, in 7 years Steady state throughput is 4/year 4 graduate Resources used efficiently 4 graduate 4-fold improvement over serial education 4 graduate 4 graduate 7 years CS 6c Lecture : Control & Performance 45 Latency versus Throughput Latency Time from entering college to graduation Serial Pipelining Throughput Average number of students graduating each year Serial Pipelining 4 Pipelining Increases throughput (4x in this example) But does nothing to latency sometimes worse (additional overhead eg for shift transition) CS 6c Lecture : Control & Performance 46 Simultaneous versus Sequential What happens sequentially? What happens simultaneously? 4 graduate 4 graduate 4 graduate 4 graduate 4 graduate 4 graduate 4 graduate 4 graduate 4 graduate 4 graduate CS 6c Lecture : Control & Performance 47 CS 6c Lecture : Control & Performance 48 8

/5/7 instruction sequence Pipelining with RISC-V Phase Pictogram tstep Serial tcycle Pipelined Instruction Fetch ps ps Reg Read ps ps ps ps Memory ps ps Register Write ps ps tinstruction 8 ps ps t instruction add t, t, t or t3, t4, t5 sll t6, t, t3 CS 6c t cycle 49 instruction sequence Pipelining with RISC-V add t, t, t or t3, t4, t5 sll t6, t, t3 t instruction t cycle Single Cycle Pipelining Timing tstep = ps tcycle = ps Register access only ps All cycles same length Instruction time, tinstruction = tcycle = 8 ps ps Clock rate, fs /8 ps = 5 GHz / ps = 5 GHz Relative speed x 4 x CS 6c 5 instruction sequence Sequential vs Simultaneous add t, t, t or t3, t4, t5 sll t6, t, t3 sw t, 4(t3) lw t, 8(t3) addi t, t, What happens sequentially, what happens simultaneously? t instruction = ps t cycle = ps CS 6c Lecture : Control & Performance 5 And in Conclusion, Tells universal datapath how to execute each instruction Instruction timing Set by instruction complexity, architecture, technology Pipelining increases clock frequency, instructions per second But does not reduce time to complete instruction Performance measures Different measures depending on objective Response time Jobs / second Energy per task CS 6c Lecture : Control & Performance 53 9