CS Digital Systems Project Laboratory. Lecture 9: Advanced Processors I

Size: px

Start display at page:

Download "CS Digital Systems Project Laboratory. Lecture 9: Advanced Processors I"

Sheena Sullivan
6 years ago
Views:

1 CS Digital Systems Project Laboratory Lecture 9: Advanced Processors I John Lazzaro ( TA: Greg Gibeling www-inst.eecs.berkeley.edu/~cs194-6/ 1

2 Today: Beyond the 5-stage pipeline Amdahl s Law Taxonomy of advanced processing. Superpipelining: Increasing the number of pipeline stages. Superscalar: Issuing several instructions in a single cycle. Hardware support for Virtual Memory and Virtual Machines. 2

3 Invented the one ISA, many implementations business model. 3

4 Amdahl s Law (of Diminishing Returns) Where program spends its time 8% Load 16% Branch 16% 8% Multiply 52% If enhancement E makes multiply infinitely fast, but other instructions are unchanged, what is the maximum speedup S? S = 1 (post-enhancement %) / 100% = 1 48%/100% = 2.08 Attributed to Gene Amdahl -- Amdahl s Law What is the lesson of Amdahl s Law? Must enhance computers in a balanced way! 4

5 Amdahl s Law in Action Program We Wish To Run On N CPUs Serial 30% Parallel 70% The program spends 30% of its time running code that can not be recoded to run in parallel. S( ) S = (30 % + (70% / N) ) / 100 % # CPUs CPUs Speedup

6 Real-world 2006: 2 CPUs vs 4 CPUs 20 in imac Core Duo 2, 2.16 GHz $1500 Mac Pro 2 Dual-Core Xeons, 2.66 GHz $3200 w/ 20 inch display. 6

7 Real-world 2006: 2 CPUs vs 4 CPUs 2 cores on one die. Source: MACWORLD 4 cores on two dies. Caveat: Mac Pro CPUs are server-class and have architectural advantages (better I/O, ECC DRAM, ETC) Simple video task: easier to parallelize. ZIPing a file: very difficult to parallelize. 7

8 Taxonomy 8

5 Stage Pipeline: A point of departure Seconds Program

caching ALU IM Reg DM Reg At best, the 5-stage pipeline executes

slowest stage Filling all delay slots (branch,load) Processor

9 5 Stage Pipeline: A point of departure Seconds Program Instructions Program Cycles Instruction Seconds Cycle Perfect caching ALU IM Reg DM Reg At best, the 5-stage pipeline executes one instruction per clock, with a clock period determined by the slowest stage Filling all delay slots (branch,load) Processor has no multi-cycle instructions (ex: multiply with an accumulate register) 9

10 Superpipelining: Add more stages Today! Seconds Program Instructions Program Cycles Instruction Seconds Cycle Also, power! Goal: Reduce critical path by adding more pipeline stages. Example: 8-stage ARM XScale: extra IF, ID, data cache stages. Difficulties: Added penalties for load delays and branch misses. Ultimate Limiter: As logic delay goes to 0, FF clk-to-q and setup. 10

11 Superscalar: Multiple issues per cycle Today! I7 IJ I##8% KL 7'DD OPQR# 7PQR# 7N8A M4< &%N Seconds Program Instructions Program Cycles Instruction Seconds Cycle Goal: Improve CPI by issuing several instructions per cycle. Example: CPU with floating point ALUs: Issue 1 FP + 1 Integer instruction per cycle. Difficulties: Load and branch delays affect more instructions. Ultimate Limiter: Programs may be a poor match to issue rules. 11

!/" #0% #(% #$, ADDD 1.2" #3% #$% #$ ( MULTD waiting on F4 to load... Difficulties: Bookkeeping is highly complex.

12 Out of Order: Going around stalls Seconds Program Instructions Program Cycles Seconds Instruction Cycle Next week Goal: Issue instructions out of program order Example:... so let ADDD go first!"#$%&'!" #$%& '()*$+ (!" #(% (,)*'+!*%+ -.!/" #0% #(% #$, ADDD 1.2" #3% #$% #$ ( MULTD waiting on F4 to load... Difficulties: Bookkeeping is highly complex. A poor fit for lockstep instruction scheduling. Ultimate Limiter: The amount of instruction level parallelism present in an application. 12

13 Dynamic Scheduling: End lockstep Goal: Enable out-of-order by breaking pipeline in two: Fetch and Execution. Example: IBM Power 5: Next week Branch redirects Out-of-order processing Instruction fetch IF IC BP Branch MP ISS RF EX pipeline Load/store WB Xfer pipeline MP ISS RF EA DC Fmt WB Xfer CP D0 D1 D2 D3 Xfer GD Group formation and instruction decode MP ISS RF EX Fixed-point WB Xfer pipeline MP ISS RF Interrupts and flushes Limiters: Design complexity, instruction level parallelism. F6 Floatingpoint WB pipeline Xfer 13

Throughput and multiple threads Goal: Use multiple CPUs (real and virtual) to improve (1) throughput of machines that run many programs (2) execution time of multithreaded programs.

14 Throughput and multiple threads Goal: Use multiple CPUs (real and virtual) to improve (1) throughput of machines that run many programs (2) execution time of multithreaded programs. Example: Sun Niagara (8 SPARCs on one chip). Difficulties: Gaining full advantage requires rewriting applications, OS, libraries. Ultimate limiter: Amdahl s law, memory system performance. Next week 14

15 Superpipelining 15

16 Note: Some stages now overlap, some instructions take extra stages. 5 Stage 8 Stage IF ID+RF EX MEM WB IM Reg DM Reg ALU IF now takes 2 stages (pipelined I-cache) ID and RF each get a stage. ALU split over 3 stages MEM takes 2 stages (pipelined D-cache) 16

17 Superpipelining techniques... Split ALU and decode logic over several pipeline stages. Pipeline memory: Use more banks of smaller arrays, add pipeline stages between decoders, muxes. Remove rarely-used forwarding networks that are on critical path. Pipeline the wires of frequently used forwarding networks. Creates stalls, affects CPI. Also: Clocking tricks (example: negedge register file in COD3e pipeline) 17

18 Recall: IBM Power Timing Closure Pipeline engineering happens here... From The circuit and physical design of the POWER4 microprocessor, IBM J Res and Dev, 46:1, Jan 2002, J.D. Warnock et al. 18

19 Recall: Pipelining SRAM memories... Architects specify number of rows and columns. Word and bit lines slow down as array grows larger! Din 3 Din 2 Din 1 Din 0 Precharge WrEn WrWrite Driver & WrWrite Driver & WrWrite Driver & WrWrite Driver & - Precharger Driver + - Precharger Driver + - Precharger Driver + - Precharger Driver + Parallel Data I/O Lines SRAM Cell SRAM Cell SRAM Cell SRAM Cell SRAM Cell SRAM Cell SRAM Cell SRAM Cell : : : : Word 0 Word 1 Address Decoder A0 A1 A2 A3 SRAM Cell SRAM Cell SRAM Cell SRAM Cell - Sense Amp + - Sense Amp + - Sense Amp + - Sense Amp + Dout 3 Dout 2 Dout 1 How could we pipeline this memory? Dout 0 Word 15 Q: Which is longer: word line or bit line? Add muxes to select subset of bits 19

ALU: Pipelining Unsigned Multiply!"#$%&#%'()*!"#$%&#%+, * 3 --.-///0-12 1011 -.--///0--2 Facts to remember 5(,$%(#/&,6*"'$7... --.- --.

20 ALU: Pipelining Unsigned Multiply!"#$%&#%'()*!"#$%&#%+, * /// ///0--2 Facts to remember 5(,$%(#/&,6*"'$ m bits x n bits = m+n bit product Binary makes it easy: 0 => place 0 ( 0 x multiplicand) 1 => place a copy ( 1 x multiplicand) 20

21 Building Block: Full-Adder Variant 1-bit signals: x, y, z, s, Cin, Cout x y z Cout Cin z: one bit of multiplier s x: one bit of multiplicand If z = 1, {Cout, s} <= x + y + Cin If z = 0, {Cout, s} <= y + Cin y: one bit of the running sum 21

22 Put it together: Array computes P = A x B To pipeline array: x y z Place registers between adder stages. Cout A 3 Cout A 3 A 2 A 2 A 1 A 1 A 0 A 0 B 0 B 1 Use registers to delay selected A and B bits. Cout P 7 Cout A 3 P 6 A 3 A 2 A 1 A 0 B 2 A 2 A 1 A 0 B 3 P 5 P 4 P 3 P 2 P 1 P 0 As drawn, combinational (slow!). 22

23 Virtex-5: DSP slice multiplier pipelining A B X Partial Product 1 Partial Product 2 Optional MREG CARRYCASCOUT* B A BCOUT* ACOUT* 30 B A B A X 18 A:B M X Y ALUMODE 4 MULTSIGNOUT* PCOUT* P 48 4 P CARRYOUT 48 P P C 0 PATTERNDETECT 48 C P 17-Bit Shift PATTERNBDETECT 30 Z 17-Bit Shift CREG/C Bypass/Mask 3 CARRYIN MULTSIGNIN* 18 OPMODE 7 CARRYCASCIN* CARRYINSEL 48 BCIN* ACIN* PCIN* *These signals are dedicated routing paths internal to the DSP48E column. They are not accessible via fabric routing resources. UG193_c1_01_

24 Virtex-5: DSP slice multiplier pipelining A B X Partial Product 1 Partial Product 2 Optional MREG CARRYCASCOUT* B A BCOUT* ACOUT* 30 B A B A X 18 A:B M X Y ALUMODE 4 MULTSIGNOUT* PCOUT* P 48 4 P CARRYOUT 48 P P C 0 PATTERNDETECT 48 C P 17-Bit Shift PATTERNBDETECT 30 Z 17-Bit Shift CREG/C Bypass/Mask 3 CARRYIN MULTSIGNIN* 18 OPMODE 7 CARRYCASCIN* CARRYINSEL 48 BCIN* ACIN* PCIN* *These signals are dedicated routing paths internal to the DSP48E column. They are not accessible via fabric routing resources. 24

25 Pre Virtex-5: Pipelining 18x18 multipliers 25

26 Add pipeline stages, reduce clock period Seconds Program Instructions Cycles Seconds Program Instruction Cycle Q. Could adding pipeline stages hurt the CPI for an application? A. Yes, due to these problems: ARM XScale 8 stages CPI Problem Taken branches cause longer stalls Cache misses take more clock cycles Possible Solution Branch prediction, loop unrolling Larger caches, add prefetch opcodes to ISA 26

27 + Recall: Control hazards... IF (Fetch) ID (Decode) EX (ALU) MEM WB 0x4 D PC Q I-Cache Instr Mem Addr Data We avoiding stalling by (1) adding a branch delay slot, and (2) adding comparator to ID stage If we add more early stages, we must stall. Sample Program Time: t1 t2 t3 t4 t5 t6 t7 t8 (ISA w/o branch Inst EX stage delay slot) I1: IF ID EX MEM WB computes I2: IF ID if branch I1: BEQ R4,R3,25 I3: IF is taken I2: AND R6,R5,R4 I4: I3: SUB R1,R9,R8 If branch is taken, I5: these instructions I6: MUST NOT complete! 27

28 + Solution: Branch prediction... IF (Fetch) ID (Decode) EX (ALU) MEM WB 0x4 D PC Q I-Cache Instr Mem Addr Data We update the PC based on the outputs of the branch predictor. If it is perfect, pipe stays full! Dynamic Predictors: a cache of branch history A control instr? Taken or Not Taken? Branch Predictor Predictions If taken, where to? What PC? Time: Inst I1: I2: I3: I4: I5: I6: t1 t2 t3 t4 t5 t6 t7 t8 EX stage IF ID EX MEM WB computes IF ID if branch is taken IF If we predicted incorrectly, these instructions MUST NOT complete! 28

Branch predictors cache branch history Address of BNEZ instruction 0b0110[...]01001000 BNEZ R1 Loop 2 bits Branch Target Buffer (BTB) 28-bit address tag 0b0110[.

29 Branch predictors cache branch history Address of BNEZ instruction 0b0110[...] BNEZ R1 Loop 2 bits Branch Target Buffer (BTB) 28-bit address tag 0b0110[...]0100 = Hit 28 bits target address PC Loop Taken Address Branch History Table (BHT) Taken or Not Taken Update BHT/BTB for next time, once true behavior known Must check prediction, kill instruction if needed % accurate 29

30 Simple ( 2-bit ) Branch History Table Entry Prediction for next branch. (1 = take, 0 = not take) Initialize to 0. Was last prediction correct? (1 = yes, 0 = no) Initialize to 1. D Q D Q Flip bit if prediction is not correct and last predict correct bit is 0. After we check prediction... Set to 1 if prediction bit was correct. Set to 0 if prediction bit was incorrect. Set to 1 if prediction bit flips. We do not change the prediction the first time it is incorrect. Why? loop: ADDI R4,R0,11 SUBI R4,R4,-1 BNE R4,R0,loop This branch taken 10 times, then not taken once (end of loop). The next time we enter the loop, we would like to predict take the first time through. 30

Detects patterns in: if (x < 12) [...] if (x < 6) [...] code. Yeh and Patt, 1992.

31 Spatial enhancements: many BHTs... 95% accurate 0b0110[...] BNEZ R1 Loop Branch History Tables (BHT00) (BHT01) (BHT10) (BHT11) Detects patterns in: if (x < 12) [...] if (x < 6) [...] code. Yeh and Patt, BHT00/01/10/11 code the last four branches in the instruction stream Adaptive function of history, state Taken or Not Taken 31

Hardware limits to superpipelining? FO4 Delays Historical limit: about 12 =88 B8 A8 @8?8 >8 48 78 68 =8 MIPS 2000 5 stages 8 A> A? A@ AA AB B8 B= B6 B7 B4 B> B?

32 Hardware limits to superpipelining? FO4 Delays Historical limit: about 12 =88 B8 > =8 MIPS stages 8 A> A? A@ AA AB B8 B= B6 B7 B4 B> B? B@ BA BB 88 8= > CPU Clock Periods Pentium Pro 10 stages FO4: How many fanout-of-4 inverter delays in the clock period. Pentium 4 20 stages Thanks to Francois Labonte, Stanford '$,-/)7A? '$,-/)4A? '$,-/)C-$,'3D '$,-/)C-$,'3D)6 '$,-/)C-$,'3D)7 '$,-/)C-$,'3D)4 '$,-/)',#$'3D E/CF#)6=8?4 E/CF#)6==?4 E/CF#)6=6?4 9C#"% 93C-"9C#"% 9C#"%?4 G'C( HI)IE I&J-")IK EGL)M? EGL)M@ EGL)NA?O?4 Power wall: Intel Core Duo has 14 stages 32

33 Superscalar Basic Idea: Improve CPI by issuing several instructions per cycle. 33

34 Recall VLIW: Super-sized Instructions Example: All instructions are 64-bit. Each instruction consists of two 32-bit MIPS instructions, that execute in parallel. Syntax: ADD $8 $9 $10 Semantics:$8 = $9 + $10 opcode rs rt rd shamt funct opcode rs rt rd shamt funct Syntax: ADD $7 $8 $9 Semantics:$7 = $8 + $9 A 64-bit VLIW instruction But what if we can t change ISA execution semantics? CS L3: Single-Cycle CPU 34

35 IF (Fetch) ID (Decode) EX (ALU) MEM WB Superscalar R machine Instruction Issue Logic 64 Data Instr Mem Addr 32 rs1 rs2 ws1 wd1 rs3 rs4 ws2 wd2 RegFile rd1 rd2 rd3 rd4 WE1 WE2 A B A B op op A L U A L U Y Y R R PC and Sequencer IF (Fetch) ID (Decode) EX (ALU) MEM WB 35

36 Sustaining Dual Instr Issues (no forwarding) IF (Fetch) ID (Decode) EX (ALU) ADD R9,R8,R7 ADD R15, R14,R13 MEM ADD R21,R20,R19 WB ADD R27 ADD R8,R0,R0 ADD R11,R0,R0 ADD R27,R26,R25 ADD R30,R29,R28 ADD R21,R20,R19 ADD R24,R23,R22 ADD R15,R14,R13 ADD R18,R17,R16 ADD R9,R8,R7 ADD R12,R11,R10 It s rarely this good... rs1 rs2 ws1 wd1 rs3 rs4 ws2 wd2 RegFile WE1 rd1 rd2 rd3 rd4 WE2 ADD R12,R11,R10 ID (Decode) A B A B op op A L U A L U ADD R18, R17,R EX (ALU) Y Y ADD R24,R23,R22 MEM R R ADD R30 WB 36

37 IF (Fetch) ID (Decode) EX (ALU) We add 12 forwarding buses (not shown). (6 to each ID from stages of both pipes). Worst-Case Instruction Issue ADD R8,R0,R0 ADD R9,R8,R0 ADD R10,R9,R0 ADD R11,R10,R0 ADD R11,R10,R0 rs1 rs2 ws1 wd1 rs3 rs4 ws2 wd2 RegFile WE1 rd1 rd2 rd3 rd4 WE2 A B A B ADD R10, R9,R op op A L U A L U Y Y MEM ADD R9,R8,R0 R R WB ADD R8, Dependencies force serialization NOP ID (Decode) NOP NOP NOP EX (ALU) MEM WB 37

Superscalar: A simple example... I7 IJ I##8% KL 7'DD OPQR# 7PQR# 7N8A Why is the control for this CPU not so hard to do? M4< &%N Example: Superscalar MIPS. Fetches 2 instructions at a time.

38 Superscalar: A simple example... I7 IJ I##8% KL 7'DD OPQR# 7PQR# 7N8A Why is the control for this CPU not so hard to do? M4< &%N Example: Superscalar MIPS. Fetches 2 instructions at a time. If first integer and second floating point, issue in same cycle 7D:@ Integer instruction FP instruction Two issues per cycle LD F0,0(R1) LD F6,-8(R1) LD F10,-16(R1) ADDD F4,F0,F2 LD F14,-24(R1) ADDD F8,F6,F2 LD F18,-32(R1) ADDD F12,F10,F2 SD 0(R1),F4 ADDD F16,F14,F2 SD -8(R1),F8 ADDD F20,F18,F2 SD -16(R1),F12 SD -24(R1),F16 One issue per cycle 38

39 Superscalar: Visualizing the pipeline M4< &%N I7 IJ I##8% KL 7'DD OPQR# 7PQR# 7N8A Type Pipe Stages Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB Three instructions potentially affected by a single cycle of load delay, as FP register loads done in the integer pipeline). 39

40 Limitations of lockstep superscalar Gets 0.5 CPI only for a 50/50 float/int mix with no hazards. For games/media, this may be OK. Extending scheme to speed up general apps (Microsoft Office,...) is complicated. If one accepts building a complicated machine, there are better ways to do it. Next Monday: Dynamic Scheduling Branch redirects Instruction fetch IF IC BP D0 Interrupts and flushes D1 D2 D3 Xfer GD Group formation and instruction decode Out-of-order processing Branch pipeline MP ISS RF EX Load/store WB Xfer pipeline MP ISS RF EA DC Fmt WB Xfer MP ISS RF EX Fixed-point WB Xfer pipeline MP ISS RF F6 Floatingpoint WB pipeline Xfer CP 40

41 Virtual Memory 41

42 The Limits of Physical Addressing Physical addresses of memory locations A0-A31 CPU D0-D31 Where we are... Data A0-A31 Memory D0-D31 All programs share one address space: The physical address space Machine language programs must be aware of the machine organization No way to prevent a program from accessing any machine resource 42

43 Apple II: A physically-addressed machine Apple ][ (1977) CPU: 1000 ns DRAM: 400 ns Steve Jobs Steve Wozniak 43

44 Apple II: A physically addressed machine Apple ][ (1977) 44

45 The Limits of Physical Addressing Physical addresses of memory locations A0-A31 CPU D0-D31 Programming the Apple ][... Data A0-A31 Memory D0-D31 All programs share one address space: The physical address space Machine language programs must be aware of the machine organization No way to prevent a program from accessing any machine resource 45

46 Solution: Add a Layer of Indirection Virtual Addresses Physical Addresses A0-A31 Virtual Physical A0-A31 CPU Address Translation Memory D0-D31 Data D0-D31 User programs run in an standardized virtual address space Address Translation hardware managed by the operating system (OS) maps virtual address to physical memory Hardware supports modern OS features: Protection, Translation, Sharing 46

47 MIPS R4000: Address Space Model Process A 0 Address Error 2 GB ASID = Address Space Identifier 2 31 Process B ASID = 12 Process A and B have ASID = 13 independent address spaces All address spaces use a standard memory map May only be accessed by kernel/supervisor When Process A writes its address 9, it writes to a different physical memory location than Process B s address 9 To let Process A and B share memory, OS maps parts of ASID 12 and ASID 13 to the same physical memory locations. 0 Address Error 2 GB Still works (slowly!) if a process accesses more virtual memory than the machine has physical memory 47

48 MIPS R4000: Who s Running on the CPU? System Control Registers 47 0 EntryHi EntryHi 10* TLB EntryLo0 EntryLo0 2* 2* EntryLo1 3* ( Safe entries) (See Random Register, contents of TLB Wired) LLAddr 17* TagLo 28* Index Index 0* Random Random 1* Page Mask Page Mask 5* Wired Wired 6* PRId 15* Config 16* TagHi 29* Used with memory management system. *Register number Context 4* Count 9* Status 12* EPC 14* WatchHi 19* ECC 26* BadVAddr 8* Compare 11* Cause 13* WatchLo 18* XContext 20* CacheErr 27* ErrorEPC 30* Used with exception processing. See Chapter 5 for details. Status (12): Indicates user, supervisor, or kernel mode User cannot write supervisor/kernel bits. Supervisor cannot write kernel bit. User cannot change address translation configuration EntryLo0 (2): 8-bit ASID field codes virtual address space ID. 48

49 MIPS Address Translation: How it works Virtual Addresses Physical Addresses A0-A31 CPU D0-D31 Data Virtual Physical Translation Look-Aside Buffer (TLB) A0-A31 Memory D0-D31 Translation Look-Aside Buffer (TLB) A small fully-associative cache of mappings from virtual to physical addresses TLB also contains ASID and kernel/supervisor bits for virtual address Fast common case: Virtual address is in TLB, process has permission to read/write it. What is the table of mappings that it caches? 49

50 Page tables encode virtual address spaces virtual address OS manages the page table for each ASID Page Table (One per ASID) Physical Memory Space frame frame frame frame A virtual address space is divided into blocks of memory called pages A machine usually supports pages of a few sizes (MIPS R4000): A page table is indexed by a virtual address Page Size 4 Kbytes 16 Kbytes 64 Kbytes 256 Kbytes 1 Mbyte A valid page table entry codes physical memory frame address for the page 4 Mbytes 16 Mbytes 50

51 The TLB caches page table entries TLB caches page table entries. virtual address page off Iphysical address Ipage off TLB Jframe page I J!"#$"! &'()*+,-./0*1% %! < =:1#>1 e Page Table Virtual Address V page no. 10 offset In this example, physical and virtual pages must be the same size! Page Table E60=*L65C= frame )6@=M=0 for ASID Access./B=K V address Rights PA./7 8,60= 7 65C= 7 65C=*C8:67 =B./*,NO@.:6C P page no. P=P8-O offset 10 MIPS handles TLB misses in software (random replacement). Other machines use hardware. Physical Physical Address V=0 pages either reside on disk or have not yet been allocated. OS handles V=0 Page fault 51

52 MIPS R4000 TLB: A closer look... Virtual Addresses Physical Addresses A0-A31 CPU D0-D31 Checked against CPO ASID 39 ASID 8 Data Virtual Physical Translation Look-Aside Buffer (TLB) Virtual Address with 1M (2 20 ) 4-Kbyte pages bits = 1M pages VPN Offset A0-A31 Memory System D0-D31 Physical space larger than virtual space! Bits 31, 30 and 29 of the virtual address select user, supervisor, or kernel address spaces. TLB Virtual-to-physical translation in TLB 36-bit Physical Address Offset passed unchanged to physical memory 35 0 PFN Offset 52

53 Can TLB and caching be overlapped? Virtual Page Number Page Offset Index Byte Select Virtual Translation Look-Aside Buffer (TLB) Physical Cache Tags Valid Cache Data Cache Block Cache Tag = Cache Block This works, but... Hit Q. What is the downside? A. Inflexibility. VPN size locked to cache tag size. Data out 53

54 Can we cache virtual addresses? Virtual Addresses Physical Addresses A0-A31 CPU D0-D31 Virtual Cache D0-D31 Virtual Physical Translation Look-Aside Buffer (TLB) A0-A31 Main Memory D0-D31 Only use TLB on a cache miss! Downside: a subtle, difficult problem. What is it? A. Synonym problem. If two address spaces share a physical frame, data may be in cache twice. Maintaining consistency is a challenge. 54

55 Virtualization 55

Parallels: Running Windows on a Mac Like software emulating a PC, but different. Use an Intel-based Mac, runs on top of OS X. Uses hardware support to create a fast virtual PC that boots Windows.

56 Parallels: Running Windows on a Mac Like software emulating a PC, but different. Use an Intel-based Mac, runs on top of OS X. Uses hardware support to create a fast virtual PC that boots Windows. +++ Reasonable performance. Virtual CPU runs 33% slower than running on physical CPU. 2 GB physical memory for a 512 MB virtual PC to run w/o disk swaps. Source: 56

57 Hardware assist? What do we mean? In an emulator, we run Windows code by simulating the CPU in software. In a virtual machine, we let safe instructions (ex: ADD R3 R2 R1) run on the actual hardware. We use hardware features to deny direct execution of instructions that could break out of the virtual machine. We have seen an example of this sort of hardware feature earlier today... 57

58 Recall: A LW that misses TLB launches a program! TLB caches page table entries. virtual address page off Iphysical address Ipage off TLB Jframe page I J!"#$"! &'()*+,-./0*1% %! < =:1#>1 e Page Table Virtual Address V page no. 10 offset In this example, physical and virtual pages must be the same size! Page Table E60=*L65C= frame )6@=M=0 for ASID Access./B=K V address Rights PA./7 8,60= 7 65C= 7 65C=*C8:67 =B./*,NO@.:6C P page no. P=P8-O offset 10 MIPS handles TLB misses in software (random replacement). Other machines use hardware. Physical Physical Address V=0 pages either reside on disk or have not yet been allocated. OS handles V=0 Page fault 58

59 General Mechanism: Instruction Trap Conceptually, we set CPU up to rewrite unsafe instructions with a function call. Sample Program AND R2,R2,R1 ADD R4,R3,R2 UNSAFE ADD R5,R2,R2 What CPU Does AND R2,R2,R1 ADD R4,R3,R2 JAL UNSAFE_STUB NOP ADD R5,R2,R2 CPU rewrite instructions??? We have already done this for pipelining... 59

60 Recall: Muxing in NOPS to do stalls Stage #1 Stage #2 Stage #3 Instr Fetch Decode & Reg Fetch Sample program ADD R4,R3,R2 OR R5,R4,R2 + D PC Q 0x4 Addr Instr Mem Data OR R5,R4,R2 Keep executing OR instruction until R4 is ready. Until then, send NOPS to 2/3. rs1 rs2 ws wd RegFile WE rd1 rd2 A M ADD R4,R3,R2 Let ADD proceed to WB stage, so that R4 is written to regfile. New datapath hardware (1) Mux into 2/3 to feed in NOP. Freeze PC and until stall is over. CS 152 L18: Advanced Processors II Ext B (2) Write enable on PC and 1/2 UC Regents Fall 2006 UCB 60

61 Conclusion: Superpipelining, Superscalar The 5 stage pipeline: a starting point for performance enhancements, a building block for multiprocessing. Superpipelining: Reduce critical path by adding more pipeline stages. Has the potential to hurt the CPI. Superscalar: Multiple instructions at once. Programs must fit the issue rules. Adds complexity. 61

62 Next Monday: This Friday: Project Meeting, 10 AM, 125 Cory Final Presentation Fri, Dec 5 62

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering Lecture 17 Advanced Processors I 2005-10-27 John Lazzaro (www.cs.berkeley.edu/~lazzaro) TAs: David Marquardt and Udam Saini www-inst.eecs.berkeley.edu/~cs152/