Agenda. Recap: Adding branches to datapath. Adding jalr to datapath. CS 61C: Great Ideas in Computer Architecture

/5/7 CS 6C: Great Ideas in Computer Architecture Lecture : Control & Operating Speed Krste Asanović & Randy Katz http://insteecsberkeleyedu/~cs6c/fa7 CS 6c Lecture : Control & Performance Recap: Adding branches to datapath Implementing JALR Instruction (I-Format) inst[:7] inst[9:5] A DataA inst[4:] B DataB inst[3:7] Imm imm[3:] Reg[rs] Reg[rs] Branch DataR DataW mem inst[3:] ImmSel RegWEn BrUnBrEq BrLT ASel BSel Sel MemRW WBSel JALR rd, rs, immediate Writes PC to Reg[rd] (return address) Sets PC = Reg[rs] + immediate Uses same immediates as arithmetic and loads no multiplication by bytes CS 6c 3 4 Adding jalr to datapath Adding jalr to datapath inst[:7] inst[9:5] inst[4:] A DataA B DataB Reg[rs] Reg[rs] Branch DataR inst[:7] inst[9:5] inst[4:] A DataA B DataB Reg[rs] Reg[rs] Branch DataR inst[3:7] Imm imm[3:] inst[3:7] Imm imm[3:] inst[3:] ImmSel RegWEn BrUnBrEq BrLT ASel BSel Sel MemRW WBSel inst[3:] ImmSel=I RegWEn= Bsel= Asel= MemRW=Read WBSel= BrUn=* BrEq=* BrLT=* Sel=Add CS 6c 5 CS 6c 6

/5/7 Implementing jal Instruction Adding jal to datapath JAL saves PC in Reg[rd] (the return address) Set PC = PC + offset (PC-relative jump) Target somewhere within ± 9 locations, bytes apart ± 8 3-bit instructions Immediate encoding optimized similarly to branch instruction to reduce hardware cost inst[:7] inst[9:5] A DataA inst[4:] B DataB inst[3:7] Imm imm[3:] Reg[rs] Reg[rs] Branch DataR DataW mem inst[3:] ImmSel RegWEn BrUnBrEq BrLT ASel BSel Sel MemRW WBSel 7 CS 6c 8 Adding jal to datapath Upper Immediate instructions inst[:7] inst[9:5] A DataA inst[4:] B DataB inst[3:7] Imm imm[3:] inst[3:] ImmSel=J RegWEn= Reg[rs] Reg[rs] Branch Bsel= BrUn=* BrEq=* BrLT=* Asel= Sel=Add DataR DataW MemRW=Read mem WBSel= Has -bit immediate in upper bits of 3-bit instruction word One destination register, rd Used for two instructions LUI Load Upper Immediate (add to zero) AUIPC Add Upper Immediate to PC CS 6c 9 Implementing lui Implementing aui inst[:7] inst[9:5] inst[4:] A DataA B DataB Reg[rs] Reg[rs] Branch DataR inst[:7] inst[9:5] inst[4:] A DataA B DataB Reg[rs] Reg[rs] Branch DataR inst[3:7] Imm imm[3:] inst[3:7] Imm imm[3:] = inst[3:] ImmSel=U RegWEn= Bsel=Asel=* Sel=B MemRW=Read WBSel= = inst[3:] ImmSel=U RegWEn= Bsel=Asel=Sel=Add MemRW= WBSel= BrUn=* BrE=* BrLT=* BrUn=* BrE=* BrLT=* CS 6c CS 6c

/5/7 Recap: Complete RV3I ISA Single-Cycle RISC-V RV3I Datapath inst[:7] inst[9:5] A DataA inst[4:] B DataB Reg[rs] Reg[rs] Branch DataR DataW mem Not in CS6C inst[3:7] Imm imm[3:] RV3I has 47 instructions total 37 instructions covered in CS6C 3 inst[3:] ImmSel RegWEn BrUnBrEq BrLT ASel BSel Sel MemRW WBSel CS 6c 4 CS 6c Lecture : Control & Performance 5 Processor Control Datapath PC Registers Arithmetic & Logic Unit () Processor Enable? Read/Write ess Write Data Read Data Memory Program Bytes Processor-Memory Interface CS 6c Lecture : Control & Performance 6 Data Single-Cycle RISC-V RV3I Datapath Control Logic Truth Table (incomplete) Inst[3:] BrEq BrLT ImmSel BrUn ASel BSel Sel MemRW RegWEn WBSel add * * * * Reg Reg Add Read sub * * * * Reg Reg Sub Read inst[:7] inst[9:5] inst[4:] A DataA B DataB Reg[rs] Reg[rs] Branch DataR (R-R Op) * * * * Reg Reg (Op) Read addi * * I * Reg Imm Add Read lw * * I * Reg Imm Add Read Mem sw * * S * Reg Imm Add Write * beq * B * PC Imm Add Read * beq * B * PC Imm Add Read * inst[3:7] Imm imm[3:] bne * B * PC Imm Add Read * bne * B * PC Imm Add Read * blt * B PC Imm Add Read * bltu * B PC Imm Add Read * inst[3:] ImmSel RegWEn BrUnBrEq BrLT ASel BSel Sel MemRW WBSel jalr * * I * Reg Imm Add Read PC Control Logic CS 6c 7 jal * * J * PC Imm Add Read PC aui * * U * PC Imm Add Read CS 6c Lecture : Control & Performance 8 3

/5/7 Control Realization Options ROM Read-Only Memory Regular structure Can be easily reprogrammed fix errors add instructions Popular when designing control logic manually Combinatorial Logic Today, chip designers use logic synthesis tools to convert truth tables to networks of gates CS 6c Lecture : Control & Performance 9 RV3I, a nine-bit ISA! Instruction type encoded using only 9 bits inst[3],inst[4:], inst[6:] inst[3] inst[4:] inst[6:] Not in CS6C ROM-based Control -bit address (inputs) Inst[3,4:,6:] BrEq BrLT 9 ROM ImmSel[:] BrUn ASel BSel Sel[3:] MemRW RegWEn WBSel[:] CS 6c Lecture : Control & Performance 3 4 5 data bits (outputs) Inst[] BrEQ BrLT ROM Controller Implementation ess Decoder add sub or jal Control Word for add Control Word for sub Control Word for or CS 6c Lecture : Control & Performance Controller output (, ImmSel, ) Administrivia Homework Due tomorrow :59 pm Project Part Due Monday Oct 9 Part due Monday Oct 6 Midterm Regrades due next Tuesday Talk to a TA if you don t understand a midterm question or are unsure of a regrade Break! CS 6c Lecture : Control & Performance 3 /5/7 4 4

/5/7 CS 6c Lecture : Control & Performance 5 Instruction Timing IF ID EX MEM WB Total I-MEM Reg Read D-MEM Reg W ps ps ps ps ps 8 ps CS 6c Lecture : Control & Performance 6 Maximum clock frequency f max = /8ps = 5 GHz Instruction Timing Instr IF = ps ID = ps = ps MEM=ps WB = ps Total add X X X X 6ps beq X X X 5ps jal X X X 5ps lw X X X X X 8ps sw X X X X 7ps Most blocks idle most of the time Eg f max, = /ps = 5 GHz! How can we keep busy all the time? 5 billion adds/sec, rather than just 5 billion? Idea: Factories use three employee shifts - equipment is always busy! CS 6c Lecture : Control & Performance 8 Performance Measures Our RISC-V executes instructions at 5 GHz instruction every 8 ps Can we improve its performance? What do we mean with this statement? Not so obvious: Quicker response time, so one job finishes faster? More jobs per unit time (eg web server returning pages)? Longer battery life? CS 6c Lecture : Control & Performance 9 Transportation Analogy Sports Car Bus Passenger Capacity 5 Travel Speed mph 5 mph Gas Mileage 5 mpg mpg 5 Mile trip: Sports Car Bus Travel Time 5 min 6 min Time for passengers 75 min min Gallons per passenger 5 gallons 5 gallons CS 6c Lecture : Control & Performance 3 5

/5/7 Transportation Trip Time Time for passengers Gallons per passenger Computer Analogy Computer Program execution time: eg time to update display Throughput: eg number of server requests handled per hour Energy per task*: eg how many movies you can watch per battery charge or energy bill for datacenter * Note: power is not a good measure, since low-power CPU might run for a long time to complete one task consuming more energy than faster computer running at higher power for a shorter time 3 Iron Law of Processor Performance Time = Instructions Cycles Time Program Program * Instruction * Cycle 3 Instructions per Program Determined by Task Algorithm, eg O(N ) vs O(N) Programming language Compiler Instruction Set Architecture (ISA) (Average) Clock cycles per Instruction Determined by ISA Processor implementation (or microarchitecture) Eg for our single-cycle RISC-V design, CPI = Complex instructions (eg strcpy), CPI >> Superscalar processors, CPI < (next lecture) CS 6c Lecture : Control & Performance 33 CS 6c Lecture : Control & Performance 34 Time per Cycle (/Frequency) Determined by Processor microarchitecture (determines critical path through logic gates) Technology (eg 4nm versus 8nm) Power budget (lower voltages reduce transistor speed) CS 6c Lecture : Control & Performance 35 Speed Tradeoff Example For some task (eg image compression) Processor A Processor B # Instructions Million 5 Million Average CPI 5 Clock rate f 5 GHz GHz Execution time ms 75 ms Processor B is faster for this task, despite executing more instructions and having a lower clock rate! CS 6c Lecture : Control & Performance 36 6

/5/7 Energy per Task Energy = Instructions Energy Program Program * Instruction Energy α Instructions * C V Program Program Energy Tradeoff Example Next-generation processor C (Moore s Law): -5 % Supply voltage, V sup : -5 % Energy consumption: - (-85) 3 = -39 % Capacitance depends on technology, processor features eg # of cores Supply voltage, eg V Significantly improved energy efficiency thanks to Moore s Law AND Reduced supply voltage Want to reduce capacitance and voltage to reduce energy/task 37 CS 6c Lecture : Control & Performance 38 39 Energy Iron Law Performance = Power * Energy Efficiency (Tasks/Second) (Joules/Second) (Tasks/Joule) Energy efficiency (eg, instructions/joule) is key metric in all computing devices For power-constrained systems (eg, MW datacenter), need better energy efficiency to get more performance at same power For energy-constrained systems (eg, W phone), need better energy efficiency to prolong battery life End of Scaling In recent years, industry has not been able to reduce supply voltage much, as reducing it further would mean increasing leakage power where transistor switches don t fully turn off (more like dimmer switch than on-off switch) Also, size of transistors and hence capacitance, not shrinking as much as before between transistor generations Power becomes a growing concern the power wall Cost-effective air-cooled chip limit around ~5W CS 6c Lecture : Control & Performance 4 Processor Trends Break! Transistors (Thousands) Frequency (MHz) Power (W) Cores 97 975 98 985 99 995 5 [Olukotun, Hammond,Sutter,Smith,Batten] 4 /5/7 4 7

/5/7 CS 6c Lecture : Control & Performance 43 A familiar example: Getting a university degree Pipelining Year Year Year 3 Year 4 Shortage of Computer scientists (your startup is growing): How long does it take to educate 6,? CS 6c Lecture : Control & Performance 44 Computer Scientist Education Option : serial 4 enter Option : pipelining year 4 graduate 4 graduate 4 graduate 6, in 6 years, average throughput is /year 4 6, in 7 years Steady state throughput is 4/year 4 graduate Resources used efficiently 4 graduate 4-fold improvement over serial education 4 graduate 4 graduate 7 years CS 6c Lecture : Control & Performance 45 Latency versus Throughput Latency Time from entering college to graduation Serial Pipelining Throughput Average number of students graduating each year Serial Pipelining 4 Pipelining Increases throughput (4x in this example) But does nothing to latency sometimes worse (additional overhead eg for shift transition) CS 6c Lecture : Control & Performance 46 Simultaneous versus Sequential What happens sequentially? What happens simultaneously? 4 graduate 4 graduate 4 graduate 4 graduate 4 graduate 4 graduate 4 graduate 4 graduate 4 graduate 4 graduate CS 6c Lecture : Control & Performance 47 CS 6c Lecture : Control & Performance 48 8

/5/7 instruction sequence Pipelining with RISC-V Phase Pictogram tstep Serial tcycle Pipelined Instruction Fetch ps ps Reg Read ps ps ps ps Memory ps ps Register Write ps ps tinstruction 8 ps ps t instruction add t, t, t or t3, t4, t5 sll t6, t, t3 CS 6c t cycle 49 instruction sequence Pipelining with RISC-V add t, t, t or t3, t4, t5 sll t6, t, t3 t instruction t cycle Single Cycle Pipelining Timing tstep = ps tcycle = ps Register access only ps All cycles same length Instruction time, tinstruction = tcycle = 8 ps ps Clock rate, fs /8 ps = 5 GHz / ps = 5 GHz Relative speed x 4 x CS 6c 5 instruction sequence Sequential vs Simultaneous add t, t, t or t3, t4, t5 sll t6, t, t3 sw t, 4(t3) lw t, 8(t3) addi t, t, What happens sequentially, what happens simultaneously? t instruction = ps t cycle = ps CS 6c Lecture : Control & Performance 5 And in Conclusion, Tells universal datapath how to execute each instruction Instruction timing Set by instruction complexity, architecture, technology Pipelining increases clock frequency, instructions per second But does not reduce time to complete instruction Performance measures Different measures depending on objective Response time Jobs / second Energy per task CS 6c Lecture : Control & Performance 53 9