CS Digital Systems Project Laboratory. Lecture 9: Advanced Processors I

Size: px
Start display at page:

Download "CS Digital Systems Project Laboratory. Lecture 9: Advanced Processors I"

Transcription

1 CS Digital Systems Project Laboratory Lecture 9: Advanced Processors I John Lazzaro ( TA: Greg Gibeling www-inst.eecs.berkeley.edu/~cs194-6/ 1

2 Today: Beyond the 5-stage pipeline Amdahl s Law Taxonomy of advanced processing. Superpipelining: Increasing the number of pipeline stages. Superscalar: Issuing several instructions in a single cycle. Hardware support for Virtual Memory and Virtual Machines. 2

3 Invented the one ISA, many implementations business model. 3

4 Amdahl s Law (of Diminishing Returns) Where program spends its time 8% Load 16% Branch 16% 8% Multiply 52% If enhancement E makes multiply infinitely fast, but other instructions are unchanged, what is the maximum speedup S? S = 1 (post-enhancement %) / 100% = 1 48%/100% = 2.08 Attributed to Gene Amdahl -- Amdahl s Law What is the lesson of Amdahl s Law? Must enhance computers in a balanced way! 4

5 Amdahl s Law in Action Program We Wish To Run On N CPUs Serial 30% Parallel 70% The program spends 30% of its time running code that can not be recoded to run in parallel. S( ) S = (30 % + (70% / N) ) / 100 % # CPUs CPUs Speedup

6 Real-world 2006: 2 CPUs vs 4 CPUs 20 in imac Core Duo 2, 2.16 GHz $1500 Mac Pro 2 Dual-Core Xeons, 2.66 GHz $3200 w/ 20 inch display. 6

7 Real-world 2006: 2 CPUs vs 4 CPUs 2 cores on one die. Source: MACWORLD 4 cores on two dies. Caveat: Mac Pro CPUs are server-class and have architectural advantages (better I/O, ECC DRAM, ETC) Simple video task: easier to parallelize. ZIPing a file: very difficult to parallelize. 7

8 Taxonomy 8

9 5 Stage Pipeline: A point of departure Seconds Program Instructions Program Cycles Instruction Seconds Cycle Perfect caching ALU IM Reg DM Reg At best, the 5-stage pipeline executes one instruction per clock, with a clock period determined by the slowest stage Filling all delay slots (branch,load) Processor has no multi-cycle instructions (ex: multiply with an accumulate register) 9

10 Superpipelining: Add more stages Today! Seconds Program Instructions Program Cycles Instruction Seconds Cycle Also, power! Goal: Reduce critical path by adding more pipeline stages. Example: 8-stage ARM XScale: extra IF, ID, data cache stages. Difficulties: Added penalties for load delays and branch misses. Ultimate Limiter: As logic delay goes to 0, FF clk-to-q and setup. 10

11 Superscalar: Multiple issues per cycle Today! I7 IJ I##8% KL 7'DD OPQR# 7PQR# 7N8A M4< &%N Seconds Program Instructions Program Cycles Instruction Seconds Cycle Goal: Improve CPI by issuing several instructions per cycle. Example: CPU with floating point ALUs: Issue 1 FP + 1 Integer instruction per cycle. Difficulties: Load and branch delays affect more instructions. Ultimate Limiter: Programs may be a poor match to issue rules. 11

12 Out of Order: Going around stalls Seconds Program Instructions Program Cycles Seconds Instruction Cycle Next week Goal: Issue instructions out of program order Example:... so let ADDD go first!"#$%&'!" #$%& '()*$+ (!" #(% (,)*'+!*%+ -.!/" #0% #(% #$, ADDD 1.2" #3% #$% #$ ( MULTD waiting on F4 to load... Difficulties: Bookkeeping is highly complex. A poor fit for lockstep instruction scheduling. Ultimate Limiter: The amount of instruction level parallelism present in an application. 12

13 Dynamic Scheduling: End lockstep Goal: Enable out-of-order by breaking pipeline in two: Fetch and Execution. Example: IBM Power 5: Next week Branch redirects Out-of-order processing Instruction fetch IF IC BP Branch MP ISS RF EX pipeline Load/store WB Xfer pipeline MP ISS RF EA DC Fmt WB Xfer CP D0 D1 D2 D3 Xfer GD Group formation and instruction decode MP ISS RF EX Fixed-point WB Xfer pipeline MP ISS RF Interrupts and flushes Limiters: Design complexity, instruction level parallelism. F6 Floatingpoint WB pipeline Xfer 13

14 Throughput and multiple threads Goal: Use multiple CPUs (real and virtual) to improve (1) throughput of machines that run many programs (2) execution time of multithreaded programs. Example: Sun Niagara (8 SPARCs on one chip). Difficulties: Gaining full advantage requires rewriting applications, OS, libraries. Ultimate limiter: Amdahl s law, memory system performance. Next week 14

15 Superpipelining 15

16 Note: Some stages now overlap, some instructions take extra stages. 5 Stage 8 Stage IF ID+RF EX MEM WB IM Reg DM Reg ALU IF now takes 2 stages (pipelined I-cache) ID and RF each get a stage. ALU split over 3 stages MEM takes 2 stages (pipelined D-cache) 16

17 Superpipelining techniques... Split ALU and decode logic over several pipeline stages. Pipeline memory: Use more banks of smaller arrays, add pipeline stages between decoders, muxes. Remove rarely-used forwarding networks that are on critical path. Pipeline the wires of frequently used forwarding networks. Creates stalls, affects CPI. Also: Clocking tricks (example: negedge register file in COD3e pipeline) 17

18 Recall: IBM Power Timing Closure Pipeline engineering happens here... From The circuit and physical design of the POWER4 microprocessor, IBM J Res and Dev, 46:1, Jan 2002, J.D. Warnock et al. 18

19 Recall: Pipelining SRAM memories... Architects specify number of rows and columns. Word and bit lines slow down as array grows larger! Din 3 Din 2 Din 1 Din 0 Precharge WrEn WrWrite Driver & WrWrite Driver & WrWrite Driver & WrWrite Driver & - Precharger Driver + - Precharger Driver + - Precharger Driver + - Precharger Driver + Parallel Data I/O Lines SRAM Cell SRAM Cell SRAM Cell SRAM Cell SRAM Cell SRAM Cell SRAM Cell SRAM Cell : : : : Word 0 Word 1 Address Decoder A0 A1 A2 A3 SRAM Cell SRAM Cell SRAM Cell SRAM Cell - Sense Amp + - Sense Amp + - Sense Amp + - Sense Amp + Dout 3 Dout 2 Dout 1 How could we pipeline this memory? Dout 0 Word 15 Q: Which is longer: word line or bit line? Add muxes to select subset of bits 19

20 ALU: Pipelining Unsigned Multiply!"#$%&#%'()*!"#$%&#%+, * /// ///0--2 Facts to remember 5(,$%(#/&,6*"'$ m bits x n bits = m+n bit product Binary makes it easy: 0 => place 0 ( 0 x multiplicand) 1 => place a copy ( 1 x multiplicand) 20

21 Building Block: Full-Adder Variant 1-bit signals: x, y, z, s, Cin, Cout x y z Cout Cin z: one bit of multiplier s x: one bit of multiplicand If z = 1, {Cout, s} <= x + y + Cin If z = 0, {Cout, s} <= y + Cin y: one bit of the running sum 21

22 Put it together: Array computes P = A x B To pipeline array: x y z Place registers between adder stages. Cout A 3 Cout A 3 A 2 A 2 A 1 A 1 A 0 A 0 B 0 B 1 Use registers to delay selected A and B bits. Cout P 7 Cout A 3 P 6 A 3 A 2 A 1 A 0 B 2 A 2 A 1 A 0 B 3 P 5 P 4 P 3 P 2 P 1 P 0 As drawn, combinational (slow!). 22

23 Virtex-5: DSP slice multiplier pipelining A B X Partial Product 1 Partial Product 2 Optional MREG CARRYCASCOUT* B A BCOUT* ACOUT* 30 B A B A X 18 A:B M X Y ALUMODE 4 MULTSIGNOUT* PCOUT* P 48 4 P CARRYOUT 48 P P C 0 PATTERNDETECT 48 C P 17-Bit Shift PATTERNBDETECT 30 Z 17-Bit Shift CREG/C Bypass/Mask 3 CARRYIN MULTSIGNIN* 18 OPMODE 7 CARRYCASCIN* CARRYINSEL 48 BCIN* ACIN* PCIN* *These signals are dedicated routing paths internal to the DSP48E column. They are not accessible via fabric routing resources. UG193_c1_01_

24 Virtex-5: DSP slice multiplier pipelining A B X Partial Product 1 Partial Product 2 Optional MREG CARRYCASCOUT* B A BCOUT* ACOUT* 30 B A B A X 18 A:B M X Y ALUMODE 4 MULTSIGNOUT* PCOUT* P 48 4 P CARRYOUT 48 P P C 0 PATTERNDETECT 48 C P 17-Bit Shift PATTERNBDETECT 30 Z 17-Bit Shift CREG/C Bypass/Mask 3 CARRYIN MULTSIGNIN* 18 OPMODE 7 CARRYCASCIN* CARRYINSEL 48 BCIN* ACIN* PCIN* *These signals are dedicated routing paths internal to the DSP48E column. They are not accessible via fabric routing resources. 24

25 Pre Virtex-5: Pipelining 18x18 multipliers 25

26 Add pipeline stages, reduce clock period Seconds Program Instructions Cycles Seconds Program Instruction Cycle Q. Could adding pipeline stages hurt the CPI for an application? A. Yes, due to these problems: ARM XScale 8 stages CPI Problem Taken branches cause longer stalls Cache misses take more clock cycles Possible Solution Branch prediction, loop unrolling Larger caches, add prefetch opcodes to ISA 26

27 + Recall: Control hazards... IF (Fetch) ID (Decode) EX (ALU) MEM WB 0x4 D PC Q I-Cache Instr Mem Addr Data We avoiding stalling by (1) adding a branch delay slot, and (2) adding comparator to ID stage If we add more early stages, we must stall. Sample Program Time: t1 t2 t3 t4 t5 t6 t7 t8 (ISA w/o branch Inst EX stage delay slot) I1: IF ID EX MEM WB computes I2: IF ID if branch I1: BEQ R4,R3,25 I3: IF is taken I2: AND R6,R5,R4 I4: I3: SUB R1,R9,R8 If branch is taken, I5: these instructions I6: MUST NOT complete! 27

28 + Solution: Branch prediction... IF (Fetch) ID (Decode) EX (ALU) MEM WB 0x4 D PC Q I-Cache Instr Mem Addr Data We update the PC based on the outputs of the branch predictor. If it is perfect, pipe stays full! Dynamic Predictors: a cache of branch history A control instr? Taken or Not Taken? Branch Predictor Predictions If taken, where to? What PC? Time: Inst I1: I2: I3: I4: I5: I6: t1 t2 t3 t4 t5 t6 t7 t8 EX stage IF ID EX MEM WB computes IF ID if branch is taken IF If we predicted incorrectly, these instructions MUST NOT complete! 28

29 Branch predictors cache branch history Address of BNEZ instruction 0b0110[...] BNEZ R1 Loop 2 bits Branch Target Buffer (BTB) 28-bit address tag 0b0110[...]0100 = Hit 28 bits target address PC Loop Taken Address Branch History Table (BHT) Taken or Not Taken Update BHT/BTB for next time, once true behavior known Must check prediction, kill instruction if needed % accurate 29

30 Simple ( 2-bit ) Branch History Table Entry Prediction for next branch. (1 = take, 0 = not take) Initialize to 0. Was last prediction correct? (1 = yes, 0 = no) Initialize to 1. D Q D Q Flip bit if prediction is not correct and last predict correct bit is 0. After we check prediction... Set to 1 if prediction bit was correct. Set to 0 if prediction bit was incorrect. Set to 1 if prediction bit flips. We do not change the prediction the first time it is incorrect. Why? loop: ADDI R4,R0,11 SUBI R4,R4,-1 BNE R4,R0,loop This branch taken 10 times, then not taken once (end of loop). The next time we enter the loop, we would like to predict take the first time through. 30

31 Spatial enhancements: many BHTs... 95% accurate 0b0110[...] BNEZ R1 Loop Branch History Tables (BHT00) (BHT01) (BHT10) (BHT11) Detects patterns in: if (x < 12) [...] if (x < 6) [...] code. Yeh and Patt, BHT00/01/10/11 code the last four branches in the instruction stream Adaptive function of history, state Taken or Not Taken 31

32 Hardware limits to superpipelining? FO4 Delays Historical limit: about 12 =88 B8 > =8 MIPS stages 8 A> A? A@ AA AB B8 B= B6 B7 B4 B> B? B@ BA BB 88 8= > CPU Clock Periods Pentium Pro 10 stages FO4: How many fanout-of-4 inverter delays in the clock period. Pentium 4 20 stages Thanks to Francois Labonte, Stanford '$,-/)7A? '$,-/)4A? '$,-/)C-$,'3D '$,-/)C-$,'3D)6 '$,-/)C-$,'3D)7 '$,-/)C-$,'3D)4 '$,-/)',#$'3D E/CF#)6=8?4 E/CF#)6==?4 E/CF#)6=6?4 9C#"% 93C-"9C#"% 9C#"%?4 G'C( HI)IE I&J-")IK EGL)M? EGL)M@ EGL)NA?O?4 Power wall: Intel Core Duo has 14 stages 32

33 Superscalar Basic Idea: Improve CPI by issuing several instructions per cycle. 33

34 Recall VLIW: Super-sized Instructions Example: All instructions are 64-bit. Each instruction consists of two 32-bit MIPS instructions, that execute in parallel. Syntax: ADD $8 $9 $10 Semantics:$8 = $9 + $10 opcode rs rt rd shamt funct opcode rs rt rd shamt funct Syntax: ADD $7 $8 $9 Semantics:$7 = $8 + $9 A 64-bit VLIW instruction But what if we can t change ISA execution semantics? CS L3: Single-Cycle CPU 34

35 IF (Fetch) ID (Decode) EX (ALU) MEM WB Superscalar R machine Instruction Issue Logic 64 Data Instr Mem Addr 32 rs1 rs2 ws1 wd1 rs3 rs4 ws2 wd2 RegFile rd1 rd2 rd3 rd4 WE1 WE2 A B A B op op A L U A L U Y Y R R PC and Sequencer IF (Fetch) ID (Decode) EX (ALU) MEM WB 35

36 Sustaining Dual Instr Issues (no forwarding) IF (Fetch) ID (Decode) EX (ALU) ADD R9,R8,R7 ADD R15, R14,R13 MEM ADD R21,R20,R19 WB ADD R27 ADD R8,R0,R0 ADD R11,R0,R0 ADD R27,R26,R25 ADD R30,R29,R28 ADD R21,R20,R19 ADD R24,R23,R22 ADD R15,R14,R13 ADD R18,R17,R16 ADD R9,R8,R7 ADD R12,R11,R10 It s rarely this good... rs1 rs2 ws1 wd1 rs3 rs4 ws2 wd2 RegFile WE1 rd1 rd2 rd3 rd4 WE2 ADD R12,R11,R10 ID (Decode) A B A B op op A L U A L U ADD R18, R17,R EX (ALU) Y Y ADD R24,R23,R22 MEM R R ADD R30 WB 36

37 IF (Fetch) ID (Decode) EX (ALU) We add 12 forwarding buses (not shown). (6 to each ID from stages of both pipes). Worst-Case Instruction Issue ADD R8,R0,R0 ADD R9,R8,R0 ADD R10,R9,R0 ADD R11,R10,R0 ADD R11,R10,R0 rs1 rs2 ws1 wd1 rs3 rs4 ws2 wd2 RegFile WE1 rd1 rd2 rd3 rd4 WE2 A B A B ADD R10, R9,R op op A L U A L U Y Y MEM ADD R9,R8,R0 R R WB ADD R8, Dependencies force serialization NOP ID (Decode) NOP NOP NOP EX (ALU) MEM WB 37

38 Superscalar: A simple example... I7 IJ I##8% KL 7'DD OPQR# 7PQR# 7N8A Why is the control for this CPU not so hard to do? M4< &%N Example: Superscalar MIPS. Fetches 2 instructions at a time. If first integer and second floating point, issue in same cycle 7D:@ Integer instruction FP instruction Two issues per cycle LD F0,0(R1) LD F6,-8(R1) LD F10,-16(R1) ADDD F4,F0,F2 LD F14,-24(R1) ADDD F8,F6,F2 LD F18,-32(R1) ADDD F12,F10,F2 SD 0(R1),F4 ADDD F16,F14,F2 SD -8(R1),F8 ADDD F20,F18,F2 SD -16(R1),F12 SD -24(R1),F16 One issue per cycle 38

39 Superscalar: Visualizing the pipeline M4< &%N I7 IJ I##8% KL 7'DD OPQR# 7PQR# 7N8A Type Pipe Stages Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB Three instructions potentially affected by a single cycle of load delay, as FP register loads done in the integer pipeline). 39

40 Limitations of lockstep superscalar Gets 0.5 CPI only for a 50/50 float/int mix with no hazards. For games/media, this may be OK. Extending scheme to speed up general apps (Microsoft Office,...) is complicated. If one accepts building a complicated machine, there are better ways to do it. Next Monday: Dynamic Scheduling Branch redirects Instruction fetch IF IC BP D0 Interrupts and flushes D1 D2 D3 Xfer GD Group formation and instruction decode Out-of-order processing Branch pipeline MP ISS RF EX Load/store WB Xfer pipeline MP ISS RF EA DC Fmt WB Xfer MP ISS RF EX Fixed-point WB Xfer pipeline MP ISS RF F6 Floatingpoint WB pipeline Xfer CP 40

41 Virtual Memory 41

42 The Limits of Physical Addressing Physical addresses of memory locations A0-A31 CPU D0-D31 Where we are... Data A0-A31 Memory D0-D31 All programs share one address space: The physical address space Machine language programs must be aware of the machine organization No way to prevent a program from accessing any machine resource 42

43 Apple II: A physically-addressed machine Apple ][ (1977) CPU: 1000 ns DRAM: 400 ns Steve Jobs Steve Wozniak 43

44 Apple II: A physically addressed machine Apple ][ (1977) 44

45 The Limits of Physical Addressing Physical addresses of memory locations A0-A31 CPU D0-D31 Programming the Apple ][... Data A0-A31 Memory D0-D31 All programs share one address space: The physical address space Machine language programs must be aware of the machine organization No way to prevent a program from accessing any machine resource 45

46 Solution: Add a Layer of Indirection Virtual Addresses Physical Addresses A0-A31 Virtual Physical A0-A31 CPU Address Translation Memory D0-D31 Data D0-D31 User programs run in an standardized virtual address space Address Translation hardware managed by the operating system (OS) maps virtual address to physical memory Hardware supports modern OS features: Protection, Translation, Sharing 46

47 MIPS R4000: Address Space Model Process A 0 Address Error 2 GB ASID = Address Space Identifier 2 31 Process B ASID = 12 Process A and B have ASID = 13 independent address spaces All address spaces use a standard memory map May only be accessed by kernel/supervisor When Process A writes its address 9, it writes to a different physical memory location than Process B s address 9 To let Process A and B share memory, OS maps parts of ASID 12 and ASID 13 to the same physical memory locations. 0 Address Error 2 GB Still works (slowly!) if a process accesses more virtual memory than the machine has physical memory 47

48 MIPS R4000: Who s Running on the CPU? System Control Registers 47 0 EntryHi EntryHi 10* TLB EntryLo0 EntryLo0 2* 2* EntryLo1 3* ( Safe entries) (See Random Register, contents of TLB Wired) LLAddr 17* TagLo 28* Index Index 0* Random Random 1* Page Mask Page Mask 5* Wired Wired 6* PRId 15* Config 16* TagHi 29* Used with memory management system. *Register number Context 4* Count 9* Status 12* EPC 14* WatchHi 19* ECC 26* BadVAddr 8* Compare 11* Cause 13* WatchLo 18* XContext 20* CacheErr 27* ErrorEPC 30* Used with exception processing. See Chapter 5 for details. Status (12): Indicates user, supervisor, or kernel mode User cannot write supervisor/kernel bits. Supervisor cannot write kernel bit. User cannot change address translation configuration EntryLo0 (2): 8-bit ASID field codes virtual address space ID. 48

49 MIPS Address Translation: How it works Virtual Addresses Physical Addresses A0-A31 CPU D0-D31 Data Virtual Physical Translation Look-Aside Buffer (TLB) A0-A31 Memory D0-D31 Translation Look-Aside Buffer (TLB) A small fully-associative cache of mappings from virtual to physical addresses TLB also contains ASID and kernel/supervisor bits for virtual address Fast common case: Virtual address is in TLB, process has permission to read/write it. What is the table of mappings that it caches? 49

50 Page tables encode virtual address spaces virtual address OS manages the page table for each ASID Page Table (One per ASID) Physical Memory Space frame frame frame frame A virtual address space is divided into blocks of memory called pages A machine usually supports pages of a few sizes (MIPS R4000): A page table is indexed by a virtual address Page Size 4 Kbytes 16 Kbytes 64 Kbytes 256 Kbytes 1 Mbyte A valid page table entry codes physical memory frame address for the page 4 Mbytes 16 Mbytes 50

51 The TLB caches page table entries TLB caches page table entries. virtual address page off Iphysical address Ipage off TLB Jframe page I J!"#$"! &'()*+,-./0*1% %! < =:1#>1 e Page Table Virtual Address V page no. 10 offset In this example, physical and virtual pages must be the same size! Page Table E60=*L65C= frame )6@=M=0 for ASID Access./B=K V address Rights PA./7 8,60= 7 65C= 7 65C=*C8:67 =B./*,NO@.:6C P page no. P=P8-O offset 10 MIPS handles TLB misses in software (random replacement). Other machines use hardware. Physical Physical Address V=0 pages either reside on disk or have not yet been allocated. OS handles V=0 Page fault 51

52 MIPS R4000 TLB: A closer look... Virtual Addresses Physical Addresses A0-A31 CPU D0-D31 Checked against CPO ASID 39 ASID 8 Data Virtual Physical Translation Look-Aside Buffer (TLB) Virtual Address with 1M (2 20 ) 4-Kbyte pages bits = 1M pages VPN Offset A0-A31 Memory System D0-D31 Physical space larger than virtual space! Bits 31, 30 and 29 of the virtual address select user, supervisor, or kernel address spaces. TLB Virtual-to-physical translation in TLB 36-bit Physical Address Offset passed unchanged to physical memory 35 0 PFN Offset 52

53 Can TLB and caching be overlapped? Virtual Page Number Page Offset Index Byte Select Virtual Translation Look-Aside Buffer (TLB) Physical Cache Tags Valid Cache Data Cache Block Cache Tag = Cache Block This works, but... Hit Q. What is the downside? A. Inflexibility. VPN size locked to cache tag size. Data out 53

54 Can we cache virtual addresses? Virtual Addresses Physical Addresses A0-A31 CPU D0-D31 Virtual Cache D0-D31 Virtual Physical Translation Look-Aside Buffer (TLB) A0-A31 Main Memory D0-D31 Only use TLB on a cache miss! Downside: a subtle, difficult problem. What is it? A. Synonym problem. If two address spaces share a physical frame, data may be in cache twice. Maintaining consistency is a challenge. 54

55 Virtualization 55

56 Parallels: Running Windows on a Mac Like software emulating a PC, but different. Use an Intel-based Mac, runs on top of OS X. Uses hardware support to create a fast virtual PC that boots Windows. +++ Reasonable performance. Virtual CPU runs 33% slower than running on physical CPU. 2 GB physical memory for a 512 MB virtual PC to run w/o disk swaps. Source: 56

57 Hardware assist? What do we mean? In an emulator, we run Windows code by simulating the CPU in software. In a virtual machine, we let safe instructions (ex: ADD R3 R2 R1) run on the actual hardware. We use hardware features to deny direct execution of instructions that could break out of the virtual machine. We have seen an example of this sort of hardware feature earlier today... 57

58 Recall: A LW that misses TLB launches a program! TLB caches page table entries. virtual address page off Iphysical address Ipage off TLB Jframe page I J!"#$"! &'()*+,-./0*1% %! < =:1#>1 e Page Table Virtual Address V page no. 10 offset In this example, physical and virtual pages must be the same size! Page Table E60=*L65C= frame )6@=M=0 for ASID Access./B=K V address Rights PA./7 8,60= 7 65C= 7 65C=*C8:67 =B./*,NO@.:6C P page no. P=P8-O offset 10 MIPS handles TLB misses in software (random replacement). Other machines use hardware. Physical Physical Address V=0 pages either reside on disk or have not yet been allocated. OS handles V=0 Page fault 58

59 General Mechanism: Instruction Trap Conceptually, we set CPU up to rewrite unsafe instructions with a function call. Sample Program AND R2,R2,R1 ADD R4,R3,R2 UNSAFE ADD R5,R2,R2 What CPU Does AND R2,R2,R1 ADD R4,R3,R2 JAL UNSAFE_STUB NOP ADD R5,R2,R2 CPU rewrite instructions??? We have already done this for pipelining... 59

60 Recall: Muxing in NOPS to do stalls Stage #1 Stage #2 Stage #3 Instr Fetch Decode & Reg Fetch Sample program ADD R4,R3,R2 OR R5,R4,R2 + D PC Q 0x4 Addr Instr Mem Data OR R5,R4,R2 Keep executing OR instruction until R4 is ready. Until then, send NOPS to 2/3. rs1 rs2 ws wd RegFile WE rd1 rd2 A M ADD R4,R3,R2 Let ADD proceed to WB stage, so that R4 is written to regfile. New datapath hardware (1) Mux into 2/3 to feed in NOP. Freeze PC and until stall is over. CS 152 L18: Advanced Processors II Ext B (2) Write enable on PC and 1/2 UC Regents Fall 2006 UCB 60

61 Conclusion: Superpipelining, Superscalar The 5 stage pipeline: a starting point for performance enhancements, a building block for multiprocessing. Superpipelining: Reduce critical path by adding more pipeline stages. Has the potential to hurt the CPI. Superscalar: Multiple instructions at once. Programs must fit the issue rules. Adds complexity. 61

62 Next Monday: This Friday: Project Meeting, 10 AM, 125 Cory Final Presentation Fri, Dec 5 62

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering CS 152 Computer Architecture and Engineering Lecture 17 Advanced Processors I 2005-10-27 John Lazzaro (www.cs.berkeley.edu/~lazzaro) TAs: David Marquardt and Udam Saini www-inst.eecs.berkeley.edu/~cs152/

More information

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering CS 152 Computer Architecture and Engineering Lecture 20 Advanced Processors I 2005-4-5 John Lazzaro (www.cs.berkeley.edu/~lazzaro) TAs: Ted Hong and David Marquardt www-inst.eecs.berkeley.edu/~cs152/ Last

More information

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering CS 152 Computer Architecture and Engineering Lecture 6 Superpipelining + Branch Prediction 2014-2-6 John Lazzaro (not a prof - John is always OK) TA: Eric Love www-inst.eecs.berkeley.edu/~cs152/ Play:

More information

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering CS 152 Computer Architecture and Engineering Lecture 18 Advanced Processors II 2006-10-31 John Lazzaro (www.cs.berkeley.edu/~lazzaro) Thanks to Krste Asanovic... TAs: Udam Saini and Jue Sun www-inst.eecs.berkeley.edu/~cs152/

More information

CS152 Computer Architecture and Engineering. Lecture 15 Virtual Memory Dave Patterson. John Lazzaro. www-inst.eecs.berkeley.

CS152 Computer Architecture and Engineering. Lecture 15 Virtual Memory Dave Patterson. John Lazzaro. www-inst.eecs.berkeley. CS152 Computer Architecture and Engineering Lecture 15 Virtual Memory 2004-10-21 Dave Patterson (www.cs.berkeley.edu/~patterson) John Lazzaro (www.cs.berkeley.edu/~lazzaro) www-inst.eecs.berkeley.edu/~cs152/

More information

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering CS 152 Computer Architecture and Engineering Lecture 12 -- Virtual Memory 2014-2-27 John Lazzaro (not a prof - John is always OK) TA: Eric Love www-inst.eecs.berkeley.edu/~cs152/ Play: CS 152 L12: Virtual

More information

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering CS 152 Computer Architecture and Engineering Lecture 22 Advanced Processors III 2005-4-12 John Lazzaro (www.cs.berkeley.edu/~lazzaro) TAs: Ted Hong and David Marquardt www-inst.eecs.berkeley.edu/~cs152/

More information

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering CS 152 Computer Architecture and Engineering Lecture 19 Advanced Processors III 2006-11-2 John Lazzaro (www.cs.berkeley.edu/~lazzaro) TAs: Udam Saini and Jue Sun www-inst.eecs.berkeley.edu/~cs152/ 1 Last

More information

EECS Digital Design

EECS Digital Design EECS 150 -- Digital Design Lecture 11-- Processor Pipelining 2010-2-23 John Wawrzynek Today s lecture by John Lazzaro www-inst.eecs.berkeley.edu/~cs150 1 Today: Pipelining How to apply the performance

More information

CS 152 Computer Architecture and Engineering Lecture 4 Pipelining

CS 152 Computer Architecture and Engineering Lecture 4 Pipelining CS 152 Computer rchitecture and Engineering Lecture 4 Pipelining 2014-1-30 John Lazzaro (not a prof - John is always OK) T: Eric Love www-inst.eecs.berkeley.edu/~cs152/ Play: 1 otorola 68000 Next week

More information

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering CS 52 Computer Architecture and Engineering Lecture 26 Mid-Term II Review 26--3 John Lazzaro (www.cs.berkeley.edu/~lazzaro) TAs: Udam Saini and Jue Sun www-inst.eecs.berkeley.edu/~cs52/ CS 52 L26: Mid-Term

More information

Multi-cycle Instructions in the Pipeline (Floating Point)

Multi-cycle Instructions in the Pipeline (Floating Point) Lecture 6 Multi-cycle Instructions in the Pipeline (Floating Point) Introduction to instruction level parallelism Recap: Support of multi-cycle instructions in a pipeline (App A.5) Recap: Superpipelining

More information

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering CS 152 Computer Architecture and Engineering Lecture 7 Pipelining I 2006-9-19 John Lazzaro (www.cs.berkeley.edu/~lazzaro) TAs: Udam Saini and Jue Sun www-inst.eecs.berkeley.edu/~cs152/ Last Time: ipod

More information

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering CS 152 Computer Architecture and Engineering Lecture 7 Pipelining I 2005-9-20 John Lazzaro (www.cs.berkeley.edu/~lazzaro) TAs: David Marquardt and Udam Saini www-inst.eecs.berkeley.edu/~cs152/ Office Hours

More information

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville Lecture : Exploiting ILP with SW Approaches Aleksandar Milenković, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Basic Pipeline Scheduling and Loop

More information

EE557--FALL 1999 MAKE-UP MIDTERM 1. Closed books, closed notes

EE557--FALL 1999 MAKE-UP MIDTERM 1. Closed books, closed notes NAME: STUDENT NUMBER: EE557--FALL 1999 MAKE-UP MIDTERM 1 Closed books, closed notes Q1: /1 Q2: /1 Q3: /1 Q4: /1 Q5: /15 Q6: /1 TOTAL: /65 Grade: /25 1 QUESTION 1(Performance evaluation) 1 points We are

More information

CS 152 Computer Architecture and Engineering Lecture 3 Metrics

CS 152 Computer Architecture and Engineering Lecture 3 Metrics CS 152 Computer Architecture and Engineering Lecture 3 Metrics 2014-1-28 John Lazzaro (not a prof - John is always OK) TA: Eric Love www-insteecsberkeleyedu/~cs152/ Play: CS 152 L3: Metrics UC Regents

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering CS 152 Computer Architecture and Engineering Lecture 22 Advanced Processors III 2004-11-18 Dave Patterson (www.cs.berkeley.edu/~patterson) John Lazzaro (www.cs.berkeley.edu/~lazzaro) www-inst.eecs.berkeley.edu/~cs152/

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

CS Digital Systems Project Laboratory. Lecture 10: Advanced Processors II

CS Digital Systems Project Laboratory. Lecture 10: Advanced Processors II CS 194-6 Digital Systems Project Laboratory Lecture 10: Advanced Processors II 2008-11-24 John Lazzaro (www.cs.berkeley.edu/~lazzaro) Thanks to Krste Asanovic... TA: Greg Gibeling www-inst.eecs.berkeley.edu/~cs194-6/

More information

EN2910A: Advanced Computer Architecture Topic 02: Review of classical concepts

EN2910A: Advanced Computer Architecture Topic 02: Review of classical concepts EN2910A: Advanced Computer Architecture Topic 02: Review of classical concepts Prof. Sherief Reda School of Engineering Brown University S. Reda EN2910A FALL'15 1 Classical concepts (prerequisite) 1. Instruction

More information

Computer Architecture. Lecture 6.1: Fundamentals of

Computer Architecture. Lecture 6.1: Fundamentals of CS3350B Computer Architecture Winter 2015 Lecture 6.1: Fundamentals of Instructional Level Parallelism Marc Moreno Maza www.csd.uwo.ca/courses/cs3350b [Adapted from lectures on Computer Organization and

More information

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14 MIPS Pipelining Computer Organization Architectures for Embedded Computing Wednesday 8 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy 4th Edition, 2011, MK

More information

COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count CPI and Cycle time Determined

More information

COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering CS 152 Computer rchitecture and Engineering Lecture 10 Pipelining III 2005-2-17 John Lazzaro (www.cs.berkeley.edu/~lazzaro) Ts: Ted Hong and David arquardt www-inst.eecs.berkeley.edu/~cs152/ Last time:

More information

Handout 4 Memory Hierarchy

Handout 4 Memory Hierarchy Handout 4 Memory Hierarchy Outline Memory hierarchy Locality Cache design Virtual address spaces Page table layout TLB design options (MMU Sub-system) Conclusion 2012/11/7 2 Since 1980, CPU has outpaced

More information

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri Department of Computer and IT Engineering University of Kurdistan Computer Architecture Pipelining By: Dr. Alireza Abdollahpouri Pipelined MIPS processor Any instruction set can be implemented in many

More information

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Computer Architectures 521480S Dynamic Branch Prediction Performance = ƒ(accuracy, cost of misprediction) Branch History Table (BHT) is simplest

More information

LECTURE 3: THE PROCESSOR

LECTURE 3: THE PROCESSOR LECTURE 3: THE PROCESSOR Abridged version of Patterson & Hennessy (2013):Ch.4 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU

More information

Lecture 7 Pipelining. Peng Liu.

Lecture 7 Pipelining. Peng Liu. Lecture 7 Pipelining Peng Liu liupeng@zju.edu.cn 1 Review: The Single Cycle Processor 2 Review: Given Datapath,RTL -> Control Instruction Inst Memory Adr Op Fun Rt

More information

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition The Processor - Introduction

More information

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor. COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor The Processor - Introduction

More information

COMPUTER ORGANIZATION AND DESI

COMPUTER ORGANIZATION AND DESI COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler

More information

Advanced issues in pipelining

Advanced issues in pipelining Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one

More information

Four Steps of Speculative Tomasulo cycle 0

Four Steps of Speculative Tomasulo cycle 0 HW support for More ILP Hardware Speculative Execution Speculation: allow an instruction to issue that is dependent on branch, without any consequences (including exceptions) if branch is predicted incorrectly

More information

Chapter 4 The Processor 1. Chapter 4A. The Processor

Chapter 4 The Processor 1. Chapter 4A. The Processor Chapter 4 The Processor 1 Chapter 4A The Processor Chapter 4 The Processor 2 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware

More information

ECE331: Hardware Organization and Design

ECE331: Hardware Organization and Design ECE331: Hardware Organization and Design Lecture 35: Final Exam Review Adapted from Computer Organization and Design, Patterson & Hennessy, UCB Material from Earlier in the Semester Throughput and latency

More information

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining Single-Cycle Design Problems Assuming fixed-period clock every instruction datapath uses one

More information

HY425 Lecture 05: Branch Prediction

HY425 Lecture 05: Branch Prediction HY425 Lecture 05: Branch Prediction Dimitrios S. Nikolopoulos University of Crete and FORTH-ICS October 19, 2011 Dimitrios S. Nikolopoulos HY425 Lecture 05: Branch Prediction 1 / 45 Exploiting ILP in hardware

More information

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false. CS 2410 Mid term (fall 2015) Name: Question 1 (10 points) Indicate which of the following statements is true and which is false. (1) SMT architectures reduces the thread context switch time by saving in

More information

Instruction Pipelining Review

Instruction Pipelining Review Instruction Pipelining Review Instruction pipelining is CPU implementation technique where multiple operations on a number of instructions are overlapped. An instruction execution pipeline involves a number

More information

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25 CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25 http://inst.eecs.berkeley.edu/~cs152/sp08 The problem

More information

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e Instruction Level Parallelism Appendix C and Chapter 3, HP5e Outline Pipelining, Hazards Branch prediction Static and Dynamic Scheduling Speculation Compiler techniques, VLIW Limits of ILP. Implementation

More information

Advanced processor designs

Advanced processor designs Advanced processor designs We ve only scratched the surface of CPU design. Today we ll briefly introduce some of the big ideas and big words behind modern processors by looking at two example CPUs. The

More information

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle? CSE 2021: Computer Organization Single Cycle (Review) Lecture-10b CPU Design : Pipelining-1 Overview, Datapath and control Shakil M. Khan 2 Single Cycle with Jump Multi-Cycle Implementation Instruction:

More information

CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards

CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards Instructors: Vladimir Stojanovic and Nicholas Weaver http://inst.eecs.berkeley.edu/~cs61c/sp16 1 Pipelined Execution Representation Time

More information

The Processor. Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut. CSE3666: Introduction to Computer Architecture

The Processor. Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut. CSE3666: Introduction to Computer Architecture The Processor Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut CSE3666: Introduction to Computer Architecture Introduction CPU performance factors Instruction count

More information

Multiple Instruction Issue. Superscalars

Multiple Instruction Issue. Superscalars Multiple Instruction Issue Multiple instructions issued each cycle better performance increase instruction throughput decrease in CPI (below 1) greater hardware complexity, potentially longer wire lengths

More information

Computer Architecture

Computer Architecture Lecture 3: Pipelining Iakovos Mavroidis Computer Science Department University of Crete 1 Previous Lecture Measurements and metrics : Performance, Cost, Dependability, Power Guidelines and principles in

More information

CPU Pipelining Issues

CPU Pipelining Issues CPU Pipelining Issues What have you been beating your head against? This pipe stuff makes my head hurt! L17 Pipeline Issues & Memory 1 Pipelining Improve performance by increasing instruction throughput

More information

CS146 Computer Architecture. Fall Midterm Exam

CS146 Computer Architecture. Fall Midterm Exam CS146 Computer Architecture Fall 2002 Midterm Exam This exam is worth a total of 100 points. Note the point breakdown below and budget your time wisely. To maximize partial credit, show your work and state

More information

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1 Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1 Introduction Chapter 4.1 Chapter 4.2 Review: MIPS (RISC) Design Principles Simplicity favors regularity fixed size instructions small number

More information

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor. Hiroaki Kobayashi // As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor. Branches will arrive up to n times faster in an n-issue processor, and providing an instruction

More information

Lecture-13 (ROB and Multi-threading) CS422-Spring

Lecture-13 (ROB and Multi-threading) CS422-Spring Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue

More information

TDT 4260 lecture 7 spring semester 2015

TDT 4260 lecture 7 spring semester 2015 1 TDT 4260 lecture 7 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU 2 Lecture overview Repetition Superscalar processor (out-of-order) Dependencies/forwarding

More information

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer CISC 662 Graduate Computer Architecture Lecture 8 - ILP 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer Architecture,

More information

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer CISC 662 Graduate Computer Architecture Lecture 8 - ILP 1 Michela Taufer Pipeline CPI http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson

More information

CS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin. School of Information Science and Technology SIST

CS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin.   School of Information Science and Technology SIST CS 110 Computer Architecture Pipelining Guest Lecture: Shu Yin http://shtech.org/courses/ca/ School of Information Science and Technology SIST ShanghaiTech University Slides based on UC Berkley's CS61C

More information

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16 Instruction Execution Consider simplified MIPS: lw/sw rt, offset(rs) add/sub/and/or/slt

More information

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture CS425 Computer Systems Architecture Fall 2017 Multiple Issue: Superscalar and VLIW CS425 - Vassilis Papaefstathiou 1 Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order

More information

CO Computer Architecture and Programming Languages CAPL. Lecture 18 & 19

CO Computer Architecture and Programming Languages CAPL. Lecture 18 & 19 CO2-3224 Computer Architecture and Programming Languages CAPL Lecture 8 & 9 Dr. Kinga Lipskoch Fall 27 Single Cycle Disadvantages & Advantages Uses the clock cycle inefficiently the clock cycle must be

More information

L19 Pipelined CPU I 1. Where are the registers? Study Chapter 6 of Text. Pipelined CPUs. Comp 411 Fall /07/07

L19 Pipelined CPU I 1. Where are the registers? Study Chapter 6 of Text. Pipelined CPUs. Comp 411 Fall /07/07 Pipelined CPUs Where are the registers? Study Chapter 6 of Text L19 Pipelined CPU I 1 Review of CPU Performance MIPS = Millions of Instructions/Second MIPS = Freq CPI Freq = Clock Frequency, MHz CPI =

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware 4.1 Introduction We will examine two MIPS implementations

More information

Pipelined CPUs. Study Chapter 4 of Text. Where are the registers?

Pipelined CPUs. Study Chapter 4 of Text. Where are the registers? Pipelined CPUs Where are the registers? Study Chapter 4 of Text Second Quiz on Friday. Covers lectures 8-14. Open book, open note, no computers or calculators. L17 Pipelined CPU I 1 Review of CPU Performance

More information

Improving Performance: Pipelining

Improving Performance: Pipelining Improving Performance: Pipelining Memory General registers Memory ID EXE MEM WB Instruction Fetch (includes PC increment) ID Instruction Decode + fetching values from general purpose registers EXE EXEcute

More information

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering CS 152 Computer Architecture and Engineering Lecture 4 Testing Processors 2005-1-27 John Lazzaro (www.cs.berkeley.edu/~lazzaro) TAs: Ted Hong and David Marquardt www-inst.eecs.berkeley.edu/~cs152/ Last

More information

CS/CoE 1541 Mid Term Exam (Fall 2018).

CS/CoE 1541 Mid Term Exam (Fall 2018). CS/CoE 1541 Mid Term Exam (Fall 2018). Name: Question 1: (6+3+3+4+4=20 points) For this question, refer to the following pipeline architecture. a) Consider the execution of the following code (5 instructions)

More information

CMCS Mohamed Younis CMCS 611, Advanced Computer Architecture 1

CMCS Mohamed Younis CMCS 611, Advanced Computer Architecture 1 CMCS 611-101 Advanced Computer Architecture Lecture 9 Pipeline Implementation Challenges October 5, 2009 www.csee.umbc.edu/~younis/cmsc611/cmsc611.htm Mohamed Younis CMCS 611, Advanced Computer Architecture

More information

Full Datapath. Chapter 4 The Processor 2

Full Datapath. Chapter 4 The Processor 2 Pipelining Full Datapath Chapter 4 The Processor 2 Datapath With Control Chapter 4 The Processor 3 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory

More information

c. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations?

c. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations? Brown University School of Engineering ENGN 164 Design of Computing Systems Professor Sherief Reda Homework 07. 140 points. Due Date: Monday May 12th in B&H 349 1. [30 points] Consider the non-pipelined

More information

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S Lecture 6 MIPS R4000 and Instruction Level Parallelism Computer Architectures 521480S Case Study: MIPS R4000 (200 MHz, 64-bit instructions, MIPS-3 instruction set) 8 Stage Pipeline: first half of fetching

More information

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering CS 152 Computer Architecture and Engineering Lecture 10 -- Cache I 2014-2-20 John Lazzaro (not a prof - John is always OK) TA: Eric Love www-inst.eecs.berkeley.edu/~cs152/ Play: CS 152 L10: Cache I UC

More information

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline Instruction Pipelining Review: MIPS In-Order Single-Issue Integer Pipeline Performance of Pipelines with Stalls Pipeline Hazards Structural hazards Data hazards Minimizing Data hazard Stalls by Forwarding

More information

CISC 662 Graduate Computer Architecture Lecture 7 - Multi-cycles

CISC 662 Graduate Computer Architecture Lecture 7 - Multi-cycles CISC 662 Graduate Computer Architecture Lecture 7 - Multi-cycles Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Orange Coast College. Business Division. Computer Science Department. CS 116- Computer Architecture. Pipelining

Orange Coast College. Business Division. Computer Science Department. CS 116- Computer Architecture. Pipelining Orange Coast College Business Division Computer Science Department CS 116- Computer Architecture Pipelining Recall Pipelining is parallelizing execution Key to speedups in processors Split instruction

More information

Advanced Computer Architecture

Advanced Computer Architecture Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata What is a Processors Architecture Instruction Set Architecture (ISA) Describes

More information

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring

More information

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering CS 152 Computer Architecture and Engineering Lecture 14 - Cache Design and Coherence 2014-3-6 John Lazzaro (not a prof - John is always OK) TA: Eric Love www-inst.eecs.berkeley.edu/~cs152/ Play: 1 Today:

More information

EITF20: Computer Architecture Part2.2.1: Pipeline-1

EITF20: Computer Architecture Part2.2.1: Pipeline-1 EITF20: Computer Architecture Part2.2.1: Pipeline-1 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle

More information

14:332:331 Pipelined Datapath

14:332:331 Pipelined Datapath 14:332:331 Pipelined Datapath I n s t r. O r d e r Inst 0 Inst 1 Inst 2 Inst 3 Inst 4 Single Cycle Disadvantages & Advantages Uses the clock cycle inefficiently the clock cycle must be timed to accommodate

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor 1 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

EECS 151/251A Fall 2017 Digital Design and Integrated Circuits. Instructor: John Wawrzynek and Nicholas Weaver. Lecture 13 EE141

EECS 151/251A Fall 2017 Digital Design and Integrated Circuits. Instructor: John Wawrzynek and Nicholas Weaver. Lecture 13 EE141 EECS 151/251A Fall 2017 Digital Design and Integrated Circuits Instructor: John Wawrzynek and Nicholas Weaver Lecture 13 Project Introduction You will design and optimize a RISC-V processor Phase 1: Design

More information

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering CS 52 Computer Architecture and Engineering Lecture 6 -- Midterm I Review Session 204-3-3 John Lazzaro (not a prof - John is always OK) TA: Eric Love www-inst.eecs.berkeley.edu/~cs52/ Play: CS 52 L6: Midterm

More information

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches Session xploiting ILP with SW Approaches lectrical and Computer ngineering University of Alabama in Huntsville Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar,

More information

Branch Prediction & Speculative Execution. Branch Penalties in Modern Pipelines

Branch Prediction & Speculative Execution. Branch Penalties in Modern Pipelines 6.823, L15--1 Branch Prediction & Speculative Execution Asanovic Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 6.823, L15--2 Branch Penalties in Modern Pipelines UltraSPARC-III

More information

INSTRUCTION LEVEL PARALLELISM

INSTRUCTION LEVEL PARALLELISM INSTRUCTION LEVEL PARALLELISM Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 2 and Appendix H, John L. Hennessy and David A. Patterson,

More information

Instr. execution impl. view

Instr. execution impl. view Pipelining Sangyeun Cho Computer Science Department Instr. execution impl. view Single (long) cycle implementation Multi-cycle implementation Pipelined implementation Processing an instruction Fetch instruction

More information

Modern Computer Architecture

Modern Computer Architecture Modern Computer Architecture Lecture2 Pipelining: Basic and Intermediate Concepts Hongbin Sun 国家集成电路人才培养基地 Xi an Jiaotong University Pipelining: Its Natural! Laundry Example Ann, Brian, Cathy, Dave each

More information

CISC 662 Graduate Computer Architecture Lecture 7 - Multi-cycles. Interrupts and Exceptions. Device Interrupt (Say, arrival of network message)

CISC 662 Graduate Computer Architecture Lecture 7 - Multi-cycles. Interrupts and Exceptions. Device Interrupt (Say, arrival of network message) CISC 662 Graduate Computer Architecture Lecture 7 - Multi-cycles Michela Taufer Interrupts and Exceptions http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy

More information

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring

More information

CMSC 411 Computer Systems Architecture Lecture 6 Basic Pipelining 3. Complications With Long Instructions

CMSC 411 Computer Systems Architecture Lecture 6 Basic Pipelining 3. Complications With Long Instructions CMSC 411 Computer Systems Architecture Lecture 6 Basic Pipelining 3 Long Instructions & MIPS Case Study Complications With Long Instructions So far, all MIPS instructions take 5 cycles But haven't talked

More information

Instruction word R0 R1 R2 R3 R4 R5 R6 R8 R12 R31

Instruction word R0 R1 R2 R3 R4 R5 R6 R8 R12 R31 4.16 Exercises 419 Exercise 4.11 In this exercise we examine in detail how an instruction is executed in a single-cycle datapath. Problems in this exercise refer to a clock cycle in which the processor

More information

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering CS 152 Computer Architecture and Engineering Lecture 15 Cache II 2005-3-8 John Lazzaro (www.cs.berkeley.edu/~lazzaro) TAs: Ted Hong and David Marquardt www-inst.eecs.berkeley.edu/~cs152/ Last Time: Locality

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information