CS Digital Systems Project Laboratory. Lecture 9: Advanced Processors I
|
|
- Sheena Sullivan
- 6 years ago
- Views:
Transcription
1 CS Digital Systems Project Laboratory Lecture 9: Advanced Processors I John Lazzaro ( TA: Greg Gibeling www-inst.eecs.berkeley.edu/~cs194-6/ 1
2 Today: Beyond the 5-stage pipeline Amdahl s Law Taxonomy of advanced processing. Superpipelining: Increasing the number of pipeline stages. Superscalar: Issuing several instructions in a single cycle. Hardware support for Virtual Memory and Virtual Machines. 2
3 Invented the one ISA, many implementations business model. 3
4 Amdahl s Law (of Diminishing Returns) Where program spends its time 8% Load 16% Branch 16% 8% Multiply 52% If enhancement E makes multiply infinitely fast, but other instructions are unchanged, what is the maximum speedup S? S = 1 (post-enhancement %) / 100% = 1 48%/100% = 2.08 Attributed to Gene Amdahl -- Amdahl s Law What is the lesson of Amdahl s Law? Must enhance computers in a balanced way! 4
5 Amdahl s Law in Action Program We Wish To Run On N CPUs Serial 30% Parallel 70% The program spends 30% of its time running code that can not be recoded to run in parallel. S( ) S = (30 % + (70% / N) ) / 100 % # CPUs CPUs Speedup
6 Real-world 2006: 2 CPUs vs 4 CPUs 20 in imac Core Duo 2, 2.16 GHz $1500 Mac Pro 2 Dual-Core Xeons, 2.66 GHz $3200 w/ 20 inch display. 6
7 Real-world 2006: 2 CPUs vs 4 CPUs 2 cores on one die. Source: MACWORLD 4 cores on two dies. Caveat: Mac Pro CPUs are server-class and have architectural advantages (better I/O, ECC DRAM, ETC) Simple video task: easier to parallelize. ZIPing a file: very difficult to parallelize. 7
8 Taxonomy 8
9 5 Stage Pipeline: A point of departure Seconds Program Instructions Program Cycles Instruction Seconds Cycle Perfect caching ALU IM Reg DM Reg At best, the 5-stage pipeline executes one instruction per clock, with a clock period determined by the slowest stage Filling all delay slots (branch,load) Processor has no multi-cycle instructions (ex: multiply with an accumulate register) 9
10 Superpipelining: Add more stages Today! Seconds Program Instructions Program Cycles Instruction Seconds Cycle Also, power! Goal: Reduce critical path by adding more pipeline stages. Example: 8-stage ARM XScale: extra IF, ID, data cache stages. Difficulties: Added penalties for load delays and branch misses. Ultimate Limiter: As logic delay goes to 0, FF clk-to-q and setup. 10
11 Superscalar: Multiple issues per cycle Today! I7 IJ I##8% KL 7'DD OPQR# 7PQR# 7N8A M4< &%N Seconds Program Instructions Program Cycles Instruction Seconds Cycle Goal: Improve CPI by issuing several instructions per cycle. Example: CPU with floating point ALUs: Issue 1 FP + 1 Integer instruction per cycle. Difficulties: Load and branch delays affect more instructions. Ultimate Limiter: Programs may be a poor match to issue rules. 11
12 Out of Order: Going around stalls Seconds Program Instructions Program Cycles Seconds Instruction Cycle Next week Goal: Issue instructions out of program order Example:... so let ADDD go first!"#$%&'!" #$%& '()*$+ (!" #(% (,)*'+!*%+ -.!/" #0% #(% #$, ADDD 1.2" #3% #$% #$ ( MULTD waiting on F4 to load... Difficulties: Bookkeeping is highly complex. A poor fit for lockstep instruction scheduling. Ultimate Limiter: The amount of instruction level parallelism present in an application. 12
13 Dynamic Scheduling: End lockstep Goal: Enable out-of-order by breaking pipeline in two: Fetch and Execution. Example: IBM Power 5: Next week Branch redirects Out-of-order processing Instruction fetch IF IC BP Branch MP ISS RF EX pipeline Load/store WB Xfer pipeline MP ISS RF EA DC Fmt WB Xfer CP D0 D1 D2 D3 Xfer GD Group formation and instruction decode MP ISS RF EX Fixed-point WB Xfer pipeline MP ISS RF Interrupts and flushes Limiters: Design complexity, instruction level parallelism. F6 Floatingpoint WB pipeline Xfer 13
14 Throughput and multiple threads Goal: Use multiple CPUs (real and virtual) to improve (1) throughput of machines that run many programs (2) execution time of multithreaded programs. Example: Sun Niagara (8 SPARCs on one chip). Difficulties: Gaining full advantage requires rewriting applications, OS, libraries. Ultimate limiter: Amdahl s law, memory system performance. Next week 14
15 Superpipelining 15
16 Note: Some stages now overlap, some instructions take extra stages. 5 Stage 8 Stage IF ID+RF EX MEM WB IM Reg DM Reg ALU IF now takes 2 stages (pipelined I-cache) ID and RF each get a stage. ALU split over 3 stages MEM takes 2 stages (pipelined D-cache) 16
17 Superpipelining techniques... Split ALU and decode logic over several pipeline stages. Pipeline memory: Use more banks of smaller arrays, add pipeline stages between decoders, muxes. Remove rarely-used forwarding networks that are on critical path. Pipeline the wires of frequently used forwarding networks. Creates stalls, affects CPI. Also: Clocking tricks (example: negedge register file in COD3e pipeline) 17
18 Recall: IBM Power Timing Closure Pipeline engineering happens here... From The circuit and physical design of the POWER4 microprocessor, IBM J Res and Dev, 46:1, Jan 2002, J.D. Warnock et al. 18
19 Recall: Pipelining SRAM memories... Architects specify number of rows and columns. Word and bit lines slow down as array grows larger! Din 3 Din 2 Din 1 Din 0 Precharge WrEn WrWrite Driver & WrWrite Driver & WrWrite Driver & WrWrite Driver & - Precharger Driver + - Precharger Driver + - Precharger Driver + - Precharger Driver + Parallel Data I/O Lines SRAM Cell SRAM Cell SRAM Cell SRAM Cell SRAM Cell SRAM Cell SRAM Cell SRAM Cell : : : : Word 0 Word 1 Address Decoder A0 A1 A2 A3 SRAM Cell SRAM Cell SRAM Cell SRAM Cell - Sense Amp + - Sense Amp + - Sense Amp + - Sense Amp + Dout 3 Dout 2 Dout 1 How could we pipeline this memory? Dout 0 Word 15 Q: Which is longer: word line or bit line? Add muxes to select subset of bits 19
20 ALU: Pipelining Unsigned Multiply!"#$%&#%'()*!"#$%&#%+, * /// ///0--2 Facts to remember 5(,$%(#/&,6*"'$ m bits x n bits = m+n bit product Binary makes it easy: 0 => place 0 ( 0 x multiplicand) 1 => place a copy ( 1 x multiplicand) 20
21 Building Block: Full-Adder Variant 1-bit signals: x, y, z, s, Cin, Cout x y z Cout Cin z: one bit of multiplier s x: one bit of multiplicand If z = 1, {Cout, s} <= x + y + Cin If z = 0, {Cout, s} <= y + Cin y: one bit of the running sum 21
22 Put it together: Array computes P = A x B To pipeline array: x y z Place registers between adder stages. Cout A 3 Cout A 3 A 2 A 2 A 1 A 1 A 0 A 0 B 0 B 1 Use registers to delay selected A and B bits. Cout P 7 Cout A 3 P 6 A 3 A 2 A 1 A 0 B 2 A 2 A 1 A 0 B 3 P 5 P 4 P 3 P 2 P 1 P 0 As drawn, combinational (slow!). 22
23 Virtex-5: DSP slice multiplier pipelining A B X Partial Product 1 Partial Product 2 Optional MREG CARRYCASCOUT* B A BCOUT* ACOUT* 30 B A B A X 18 A:B M X Y ALUMODE 4 MULTSIGNOUT* PCOUT* P 48 4 P CARRYOUT 48 P P C 0 PATTERNDETECT 48 C P 17-Bit Shift PATTERNBDETECT 30 Z 17-Bit Shift CREG/C Bypass/Mask 3 CARRYIN MULTSIGNIN* 18 OPMODE 7 CARRYCASCIN* CARRYINSEL 48 BCIN* ACIN* PCIN* *These signals are dedicated routing paths internal to the DSP48E column. They are not accessible via fabric routing resources. UG193_c1_01_
24 Virtex-5: DSP slice multiplier pipelining A B X Partial Product 1 Partial Product 2 Optional MREG CARRYCASCOUT* B A BCOUT* ACOUT* 30 B A B A X 18 A:B M X Y ALUMODE 4 MULTSIGNOUT* PCOUT* P 48 4 P CARRYOUT 48 P P C 0 PATTERNDETECT 48 C P 17-Bit Shift PATTERNBDETECT 30 Z 17-Bit Shift CREG/C Bypass/Mask 3 CARRYIN MULTSIGNIN* 18 OPMODE 7 CARRYCASCIN* CARRYINSEL 48 BCIN* ACIN* PCIN* *These signals are dedicated routing paths internal to the DSP48E column. They are not accessible via fabric routing resources. 24
25 Pre Virtex-5: Pipelining 18x18 multipliers 25
26 Add pipeline stages, reduce clock period Seconds Program Instructions Cycles Seconds Program Instruction Cycle Q. Could adding pipeline stages hurt the CPI for an application? A. Yes, due to these problems: ARM XScale 8 stages CPI Problem Taken branches cause longer stalls Cache misses take more clock cycles Possible Solution Branch prediction, loop unrolling Larger caches, add prefetch opcodes to ISA 26
27 + Recall: Control hazards... IF (Fetch) ID (Decode) EX (ALU) MEM WB 0x4 D PC Q I-Cache Instr Mem Addr Data We avoiding stalling by (1) adding a branch delay slot, and (2) adding comparator to ID stage If we add more early stages, we must stall. Sample Program Time: t1 t2 t3 t4 t5 t6 t7 t8 (ISA w/o branch Inst EX stage delay slot) I1: IF ID EX MEM WB computes I2: IF ID if branch I1: BEQ R4,R3,25 I3: IF is taken I2: AND R6,R5,R4 I4: I3: SUB R1,R9,R8 If branch is taken, I5: these instructions I6: MUST NOT complete! 27
28 + Solution: Branch prediction... IF (Fetch) ID (Decode) EX (ALU) MEM WB 0x4 D PC Q I-Cache Instr Mem Addr Data We update the PC based on the outputs of the branch predictor. If it is perfect, pipe stays full! Dynamic Predictors: a cache of branch history A control instr? Taken or Not Taken? Branch Predictor Predictions If taken, where to? What PC? Time: Inst I1: I2: I3: I4: I5: I6: t1 t2 t3 t4 t5 t6 t7 t8 EX stage IF ID EX MEM WB computes IF ID if branch is taken IF If we predicted incorrectly, these instructions MUST NOT complete! 28
29 Branch predictors cache branch history Address of BNEZ instruction 0b0110[...] BNEZ R1 Loop 2 bits Branch Target Buffer (BTB) 28-bit address tag 0b0110[...]0100 = Hit 28 bits target address PC Loop Taken Address Branch History Table (BHT) Taken or Not Taken Update BHT/BTB for next time, once true behavior known Must check prediction, kill instruction if needed % accurate 29
30 Simple ( 2-bit ) Branch History Table Entry Prediction for next branch. (1 = take, 0 = not take) Initialize to 0. Was last prediction correct? (1 = yes, 0 = no) Initialize to 1. D Q D Q Flip bit if prediction is not correct and last predict correct bit is 0. After we check prediction... Set to 1 if prediction bit was correct. Set to 0 if prediction bit was incorrect. Set to 1 if prediction bit flips. We do not change the prediction the first time it is incorrect. Why? loop: ADDI R4,R0,11 SUBI R4,R4,-1 BNE R4,R0,loop This branch taken 10 times, then not taken once (end of loop). The next time we enter the loop, we would like to predict take the first time through. 30
31 Spatial enhancements: many BHTs... 95% accurate 0b0110[...] BNEZ R1 Loop Branch History Tables (BHT00) (BHT01) (BHT10) (BHT11) Detects patterns in: if (x < 12) [...] if (x < 6) [...] code. Yeh and Patt, BHT00/01/10/11 code the last four branches in the instruction stream Adaptive function of history, state Taken or Not Taken 31
32 Hardware limits to superpipelining? FO4 Delays Historical limit: about 12 =88 B8 > =8 MIPS stages 8 A> A? A@ AA AB B8 B= B6 B7 B4 B> B? B@ BA BB 88 8= > CPU Clock Periods Pentium Pro 10 stages FO4: How many fanout-of-4 inverter delays in the clock period. Pentium 4 20 stages Thanks to Francois Labonte, Stanford '$,-/)7A? '$,-/)4A? '$,-/)C-$,'3D '$,-/)C-$,'3D)6 '$,-/)C-$,'3D)7 '$,-/)C-$,'3D)4 '$,-/)',#$'3D E/CF#)6=8?4 E/CF#)6==?4 E/CF#)6=6?4 9C#"% 93C-"9C#"% 9C#"%?4 G'C( HI)IE I&J-")IK EGL)M? EGL)M@ EGL)NA?O?4 Power wall: Intel Core Duo has 14 stages 32
33 Superscalar Basic Idea: Improve CPI by issuing several instructions per cycle. 33
34 Recall VLIW: Super-sized Instructions Example: All instructions are 64-bit. Each instruction consists of two 32-bit MIPS instructions, that execute in parallel. Syntax: ADD $8 $9 $10 Semantics:$8 = $9 + $10 opcode rs rt rd shamt funct opcode rs rt rd shamt funct Syntax: ADD $7 $8 $9 Semantics:$7 = $8 + $9 A 64-bit VLIW instruction But what if we can t change ISA execution semantics? CS L3: Single-Cycle CPU 34
35 IF (Fetch) ID (Decode) EX (ALU) MEM WB Superscalar R machine Instruction Issue Logic 64 Data Instr Mem Addr 32 rs1 rs2 ws1 wd1 rs3 rs4 ws2 wd2 RegFile rd1 rd2 rd3 rd4 WE1 WE2 A B A B op op A L U A L U Y Y R R PC and Sequencer IF (Fetch) ID (Decode) EX (ALU) MEM WB 35
36 Sustaining Dual Instr Issues (no forwarding) IF (Fetch) ID (Decode) EX (ALU) ADD R9,R8,R7 ADD R15, R14,R13 MEM ADD R21,R20,R19 WB ADD R27 ADD R8,R0,R0 ADD R11,R0,R0 ADD R27,R26,R25 ADD R30,R29,R28 ADD R21,R20,R19 ADD R24,R23,R22 ADD R15,R14,R13 ADD R18,R17,R16 ADD R9,R8,R7 ADD R12,R11,R10 It s rarely this good... rs1 rs2 ws1 wd1 rs3 rs4 ws2 wd2 RegFile WE1 rd1 rd2 rd3 rd4 WE2 ADD R12,R11,R10 ID (Decode) A B A B op op A L U A L U ADD R18, R17,R EX (ALU) Y Y ADD R24,R23,R22 MEM R R ADD R30 WB 36
37 IF (Fetch) ID (Decode) EX (ALU) We add 12 forwarding buses (not shown). (6 to each ID from stages of both pipes). Worst-Case Instruction Issue ADD R8,R0,R0 ADD R9,R8,R0 ADD R10,R9,R0 ADD R11,R10,R0 ADD R11,R10,R0 rs1 rs2 ws1 wd1 rs3 rs4 ws2 wd2 RegFile WE1 rd1 rd2 rd3 rd4 WE2 A B A B ADD R10, R9,R op op A L U A L U Y Y MEM ADD R9,R8,R0 R R WB ADD R8, Dependencies force serialization NOP ID (Decode) NOP NOP NOP EX (ALU) MEM WB 37
38 Superscalar: A simple example... I7 IJ I##8% KL 7'DD OPQR# 7PQR# 7N8A Why is the control for this CPU not so hard to do? M4< &%N Example: Superscalar MIPS. Fetches 2 instructions at a time. If first integer and second floating point, issue in same cycle 7D:@ Integer instruction FP instruction Two issues per cycle LD F0,0(R1) LD F6,-8(R1) LD F10,-16(R1) ADDD F4,F0,F2 LD F14,-24(R1) ADDD F8,F6,F2 LD F18,-32(R1) ADDD F12,F10,F2 SD 0(R1),F4 ADDD F16,F14,F2 SD -8(R1),F8 ADDD F20,F18,F2 SD -16(R1),F12 SD -24(R1),F16 One issue per cycle 38
39 Superscalar: Visualizing the pipeline M4< &%N I7 IJ I##8% KL 7'DD OPQR# 7PQR# 7N8A Type Pipe Stages Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB Three instructions potentially affected by a single cycle of load delay, as FP register loads done in the integer pipeline). 39
40 Limitations of lockstep superscalar Gets 0.5 CPI only for a 50/50 float/int mix with no hazards. For games/media, this may be OK. Extending scheme to speed up general apps (Microsoft Office,...) is complicated. If one accepts building a complicated machine, there are better ways to do it. Next Monday: Dynamic Scheduling Branch redirects Instruction fetch IF IC BP D0 Interrupts and flushes D1 D2 D3 Xfer GD Group formation and instruction decode Out-of-order processing Branch pipeline MP ISS RF EX Load/store WB Xfer pipeline MP ISS RF EA DC Fmt WB Xfer MP ISS RF EX Fixed-point WB Xfer pipeline MP ISS RF F6 Floatingpoint WB pipeline Xfer CP 40
41 Virtual Memory 41
42 The Limits of Physical Addressing Physical addresses of memory locations A0-A31 CPU D0-D31 Where we are... Data A0-A31 Memory D0-D31 All programs share one address space: The physical address space Machine language programs must be aware of the machine organization No way to prevent a program from accessing any machine resource 42
43 Apple II: A physically-addressed machine Apple ][ (1977) CPU: 1000 ns DRAM: 400 ns Steve Jobs Steve Wozniak 43
44 Apple II: A physically addressed machine Apple ][ (1977) 44
45 The Limits of Physical Addressing Physical addresses of memory locations A0-A31 CPU D0-D31 Programming the Apple ][... Data A0-A31 Memory D0-D31 All programs share one address space: The physical address space Machine language programs must be aware of the machine organization No way to prevent a program from accessing any machine resource 45
46 Solution: Add a Layer of Indirection Virtual Addresses Physical Addresses A0-A31 Virtual Physical A0-A31 CPU Address Translation Memory D0-D31 Data D0-D31 User programs run in an standardized virtual address space Address Translation hardware managed by the operating system (OS) maps virtual address to physical memory Hardware supports modern OS features: Protection, Translation, Sharing 46
47 MIPS R4000: Address Space Model Process A 0 Address Error 2 GB ASID = Address Space Identifier 2 31 Process B ASID = 12 Process A and B have ASID = 13 independent address spaces All address spaces use a standard memory map May only be accessed by kernel/supervisor When Process A writes its address 9, it writes to a different physical memory location than Process B s address 9 To let Process A and B share memory, OS maps parts of ASID 12 and ASID 13 to the same physical memory locations. 0 Address Error 2 GB Still works (slowly!) if a process accesses more virtual memory than the machine has physical memory 47
48 MIPS R4000: Who s Running on the CPU? System Control Registers 47 0 EntryHi EntryHi 10* TLB EntryLo0 EntryLo0 2* 2* EntryLo1 3* ( Safe entries) (See Random Register, contents of TLB Wired) LLAddr 17* TagLo 28* Index Index 0* Random Random 1* Page Mask Page Mask 5* Wired Wired 6* PRId 15* Config 16* TagHi 29* Used with memory management system. *Register number Context 4* Count 9* Status 12* EPC 14* WatchHi 19* ECC 26* BadVAddr 8* Compare 11* Cause 13* WatchLo 18* XContext 20* CacheErr 27* ErrorEPC 30* Used with exception processing. See Chapter 5 for details. Status (12): Indicates user, supervisor, or kernel mode User cannot write supervisor/kernel bits. Supervisor cannot write kernel bit. User cannot change address translation configuration EntryLo0 (2): 8-bit ASID field codes virtual address space ID. 48
49 MIPS Address Translation: How it works Virtual Addresses Physical Addresses A0-A31 CPU D0-D31 Data Virtual Physical Translation Look-Aside Buffer (TLB) A0-A31 Memory D0-D31 Translation Look-Aside Buffer (TLB) A small fully-associative cache of mappings from virtual to physical addresses TLB also contains ASID and kernel/supervisor bits for virtual address Fast common case: Virtual address is in TLB, process has permission to read/write it. What is the table of mappings that it caches? 49
50 Page tables encode virtual address spaces virtual address OS manages the page table for each ASID Page Table (One per ASID) Physical Memory Space frame frame frame frame A virtual address space is divided into blocks of memory called pages A machine usually supports pages of a few sizes (MIPS R4000): A page table is indexed by a virtual address Page Size 4 Kbytes 16 Kbytes 64 Kbytes 256 Kbytes 1 Mbyte A valid page table entry codes physical memory frame address for the page 4 Mbytes 16 Mbytes 50
51 The TLB caches page table entries TLB caches page table entries. virtual address page off Iphysical address Ipage off TLB Jframe page I J!"#$"! &'()*+,-./0*1% %! < =:1#>1 e Page Table Virtual Address V page no. 10 offset In this example, physical and virtual pages must be the same size! Page Table E60=*L65C= frame )6@=M=0 for ASID Access./B=K V address Rights PA./7 8,60= 7 65C= 7 65C=*C8:67 =B./*,NO@.:6C P page no. P=P8-O offset 10 MIPS handles TLB misses in software (random replacement). Other machines use hardware. Physical Physical Address V=0 pages either reside on disk or have not yet been allocated. OS handles V=0 Page fault 51
52 MIPS R4000 TLB: A closer look... Virtual Addresses Physical Addresses A0-A31 CPU D0-D31 Checked against CPO ASID 39 ASID 8 Data Virtual Physical Translation Look-Aside Buffer (TLB) Virtual Address with 1M (2 20 ) 4-Kbyte pages bits = 1M pages VPN Offset A0-A31 Memory System D0-D31 Physical space larger than virtual space! Bits 31, 30 and 29 of the virtual address select user, supervisor, or kernel address spaces. TLB Virtual-to-physical translation in TLB 36-bit Physical Address Offset passed unchanged to physical memory 35 0 PFN Offset 52
53 Can TLB and caching be overlapped? Virtual Page Number Page Offset Index Byte Select Virtual Translation Look-Aside Buffer (TLB) Physical Cache Tags Valid Cache Data Cache Block Cache Tag = Cache Block This works, but... Hit Q. What is the downside? A. Inflexibility. VPN size locked to cache tag size. Data out 53
54 Can we cache virtual addresses? Virtual Addresses Physical Addresses A0-A31 CPU D0-D31 Virtual Cache D0-D31 Virtual Physical Translation Look-Aside Buffer (TLB) A0-A31 Main Memory D0-D31 Only use TLB on a cache miss! Downside: a subtle, difficult problem. What is it? A. Synonym problem. If two address spaces share a physical frame, data may be in cache twice. Maintaining consistency is a challenge. 54
55 Virtualization 55
56 Parallels: Running Windows on a Mac Like software emulating a PC, but different. Use an Intel-based Mac, runs on top of OS X. Uses hardware support to create a fast virtual PC that boots Windows. +++ Reasonable performance. Virtual CPU runs 33% slower than running on physical CPU. 2 GB physical memory for a 512 MB virtual PC to run w/o disk swaps. Source: 56
57 Hardware assist? What do we mean? In an emulator, we run Windows code by simulating the CPU in software. In a virtual machine, we let safe instructions (ex: ADD R3 R2 R1) run on the actual hardware. We use hardware features to deny direct execution of instructions that could break out of the virtual machine. We have seen an example of this sort of hardware feature earlier today... 57
58 Recall: A LW that misses TLB launches a program! TLB caches page table entries. virtual address page off Iphysical address Ipage off TLB Jframe page I J!"#$"! &'()*+,-./0*1% %! < =:1#>1 e Page Table Virtual Address V page no. 10 offset In this example, physical and virtual pages must be the same size! Page Table E60=*L65C= frame )6@=M=0 for ASID Access./B=K V address Rights PA./7 8,60= 7 65C= 7 65C=*C8:67 =B./*,NO@.:6C P page no. P=P8-O offset 10 MIPS handles TLB misses in software (random replacement). Other machines use hardware. Physical Physical Address V=0 pages either reside on disk or have not yet been allocated. OS handles V=0 Page fault 58
59 General Mechanism: Instruction Trap Conceptually, we set CPU up to rewrite unsafe instructions with a function call. Sample Program AND R2,R2,R1 ADD R4,R3,R2 UNSAFE ADD R5,R2,R2 What CPU Does AND R2,R2,R1 ADD R4,R3,R2 JAL UNSAFE_STUB NOP ADD R5,R2,R2 CPU rewrite instructions??? We have already done this for pipelining... 59
60 Recall: Muxing in NOPS to do stalls Stage #1 Stage #2 Stage #3 Instr Fetch Decode & Reg Fetch Sample program ADD R4,R3,R2 OR R5,R4,R2 + D PC Q 0x4 Addr Instr Mem Data OR R5,R4,R2 Keep executing OR instruction until R4 is ready. Until then, send NOPS to 2/3. rs1 rs2 ws wd RegFile WE rd1 rd2 A M ADD R4,R3,R2 Let ADD proceed to WB stage, so that R4 is written to regfile. New datapath hardware (1) Mux into 2/3 to feed in NOP. Freeze PC and until stall is over. CS 152 L18: Advanced Processors II Ext B (2) Write enable on PC and 1/2 UC Regents Fall 2006 UCB 60
61 Conclusion: Superpipelining, Superscalar The 5 stage pipeline: a starting point for performance enhancements, a building block for multiprocessing. Superpipelining: Reduce critical path by adding more pipeline stages. Has the potential to hurt the CPI. Superscalar: Multiple instructions at once. Programs must fit the issue rules. Adds complexity. 61
62 Next Monday: This Friday: Project Meeting, 10 AM, 125 Cory Final Presentation Fri, Dec 5 62
CS 152 Computer Architecture and Engineering
CS 152 Computer Architecture and Engineering Lecture 17 Advanced Processors I 2005-10-27 John Lazzaro (www.cs.berkeley.edu/~lazzaro) TAs: David Marquardt and Udam Saini www-inst.eecs.berkeley.edu/~cs152/
More informationCS 152 Computer Architecture and Engineering
CS 152 Computer Architecture and Engineering Lecture 20 Advanced Processors I 2005-4-5 John Lazzaro (www.cs.berkeley.edu/~lazzaro) TAs: Ted Hong and David Marquardt www-inst.eecs.berkeley.edu/~cs152/ Last
More informationCS 152 Computer Architecture and Engineering
CS 152 Computer Architecture and Engineering Lecture 6 Superpipelining + Branch Prediction 2014-2-6 John Lazzaro (not a prof - John is always OK) TA: Eric Love www-inst.eecs.berkeley.edu/~cs152/ Play:
More informationCS 152 Computer Architecture and Engineering
CS 152 Computer Architecture and Engineering Lecture 18 Advanced Processors II 2006-10-31 John Lazzaro (www.cs.berkeley.edu/~lazzaro) Thanks to Krste Asanovic... TAs: Udam Saini and Jue Sun www-inst.eecs.berkeley.edu/~cs152/
More informationCS152 Computer Architecture and Engineering. Lecture 15 Virtual Memory Dave Patterson. John Lazzaro. www-inst.eecs.berkeley.
CS152 Computer Architecture and Engineering Lecture 15 Virtual Memory 2004-10-21 Dave Patterson (www.cs.berkeley.edu/~patterson) John Lazzaro (www.cs.berkeley.edu/~lazzaro) www-inst.eecs.berkeley.edu/~cs152/
More informationCS 152 Computer Architecture and Engineering
CS 152 Computer Architecture and Engineering Lecture 12 -- Virtual Memory 2014-2-27 John Lazzaro (not a prof - John is always OK) TA: Eric Love www-inst.eecs.berkeley.edu/~cs152/ Play: CS 152 L12: Virtual
More informationCS 152 Computer Architecture and Engineering
CS 152 Computer Architecture and Engineering Lecture 22 Advanced Processors III 2005-4-12 John Lazzaro (www.cs.berkeley.edu/~lazzaro) TAs: Ted Hong and David Marquardt www-inst.eecs.berkeley.edu/~cs152/
More informationCS 152 Computer Architecture and Engineering
CS 152 Computer Architecture and Engineering Lecture 19 Advanced Processors III 2006-11-2 John Lazzaro (www.cs.berkeley.edu/~lazzaro) TAs: Udam Saini and Jue Sun www-inst.eecs.berkeley.edu/~cs152/ 1 Last
More informationEECS Digital Design
EECS 150 -- Digital Design Lecture 11-- Processor Pipelining 2010-2-23 John Wawrzynek Today s lecture by John Lazzaro www-inst.eecs.berkeley.edu/~cs150 1 Today: Pipelining How to apply the performance
More informationCS 152 Computer Architecture and Engineering Lecture 4 Pipelining
CS 152 Computer rchitecture and Engineering Lecture 4 Pipelining 2014-1-30 John Lazzaro (not a prof - John is always OK) T: Eric Love www-inst.eecs.berkeley.edu/~cs152/ Play: 1 otorola 68000 Next week
More informationCS 152 Computer Architecture and Engineering
CS 52 Computer Architecture and Engineering Lecture 26 Mid-Term II Review 26--3 John Lazzaro (www.cs.berkeley.edu/~lazzaro) TAs: Udam Saini and Jue Sun www-inst.eecs.berkeley.edu/~cs52/ CS 52 L26: Mid-Term
More informationMulti-cycle Instructions in the Pipeline (Floating Point)
Lecture 6 Multi-cycle Instructions in the Pipeline (Floating Point) Introduction to instruction level parallelism Recap: Support of multi-cycle instructions in a pipeline (App A.5) Recap: Superpipelining
More informationCISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1
CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More informationChapter 4. The Processor
Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified
More informationCS 152 Computer Architecture and Engineering
CS 152 Computer Architecture and Engineering Lecture 7 Pipelining I 2006-9-19 John Lazzaro (www.cs.berkeley.edu/~lazzaro) TAs: Udam Saini and Jue Sun www-inst.eecs.berkeley.edu/~cs152/ Last Time: ipod
More informationCS 152 Computer Architecture and Engineering
CS 152 Computer Architecture and Engineering Lecture 7 Pipelining I 2005-9-20 John Lazzaro (www.cs.berkeley.edu/~lazzaro) TAs: David Marquardt and Udam Saini www-inst.eecs.berkeley.edu/~cs152/ Office Hours
More informationExploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville
Lecture : Exploiting ILP with SW Approaches Aleksandar Milenković, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Basic Pipeline Scheduling and Loop
More informationEE557--FALL 1999 MAKE-UP MIDTERM 1. Closed books, closed notes
NAME: STUDENT NUMBER: EE557--FALL 1999 MAKE-UP MIDTERM 1 Closed books, closed notes Q1: /1 Q2: /1 Q3: /1 Q4: /1 Q5: /15 Q6: /1 TOTAL: /65 Grade: /25 1 QUESTION 1(Performance evaluation) 1 points We are
More informationCS 152 Computer Architecture and Engineering Lecture 3 Metrics
CS 152 Computer Architecture and Engineering Lecture 3 Metrics 2014-1-28 John Lazzaro (not a prof - John is always OK) TA: Eric Love www-insteecsberkeleyedu/~cs152/ Play: CS 152 L3: Metrics UC Regents
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle
More informationCS 152 Computer Architecture and Engineering
CS 152 Computer Architecture and Engineering Lecture 22 Advanced Processors III 2004-11-18 Dave Patterson (www.cs.berkeley.edu/~patterson) John Lazzaro (www.cs.berkeley.edu/~lazzaro) www-inst.eecs.berkeley.edu/~cs152/
More informationChapter 4. The Processor
Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified
More informationCS Digital Systems Project Laboratory. Lecture 10: Advanced Processors II
CS 194-6 Digital Systems Project Laboratory Lecture 10: Advanced Processors II 2008-11-24 John Lazzaro (www.cs.berkeley.edu/~lazzaro) Thanks to Krste Asanovic... TA: Greg Gibeling www-inst.eecs.berkeley.edu/~cs194-6/
More informationEN2910A: Advanced Computer Architecture Topic 02: Review of classical concepts
EN2910A: Advanced Computer Architecture Topic 02: Review of classical concepts Prof. Sherief Reda School of Engineering Brown University S. Reda EN2910A FALL'15 1 Classical concepts (prerequisite) 1. Instruction
More informationComputer Architecture. Lecture 6.1: Fundamentals of
CS3350B Computer Architecture Winter 2015 Lecture 6.1: Fundamentals of Instructional Level Parallelism Marc Moreno Maza www.csd.uwo.ca/courses/cs3350b [Adapted from lectures on Computer Organization and
More informationMIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14
MIPS Pipelining Computer Organization Architectures for Embedded Computing Wednesday 8 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy 4th Edition, 2011, MK
More informationCOMPUTER ORGANIZATION AND DESIGN
COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count CPI and Cycle time Determined
More informationCOMPUTER ORGANIZATION AND DESIGN
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle
More informationCS 152 Computer Architecture and Engineering
CS 152 Computer rchitecture and Engineering Lecture 10 Pipelining III 2005-2-17 John Lazzaro (www.cs.berkeley.edu/~lazzaro) Ts: Ted Hong and David arquardt www-inst.eecs.berkeley.edu/~cs152/ Last time:
More informationHandout 4 Memory Hierarchy
Handout 4 Memory Hierarchy Outline Memory hierarchy Locality Cache design Virtual address spaces Page table layout TLB design options (MMU Sub-system) Conclusion 2012/11/7 2 Since 1980, CPU has outpaced
More informationCOMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle
More informationDepartment of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri
Department of Computer and IT Engineering University of Kurdistan Computer Architecture Pipelining By: Dr. Alireza Abdollahpouri Pipelined MIPS processor Any instruction set can be implemented in many
More informationLecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Computer Architectures 521480S Dynamic Branch Prediction Performance = ƒ(accuracy, cost of misprediction) Branch History Table (BHT) is simplest
More informationLECTURE 3: THE PROCESSOR
LECTURE 3: THE PROCESSOR Abridged version of Patterson & Hennessy (2013):Ch.4 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU
More informationLecture 7 Pipelining. Peng Liu.
Lecture 7 Pipelining Peng Liu liupeng@zju.edu.cn 1 Review: The Single Cycle Processor 2 Review: Given Datapath,RTL -> Control Instruction Inst Memory Adr Op Fun Rt
More informationCOMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition The Processor - Introduction
More informationChapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor The Processor - Introduction
More informationCOMPUTER ORGANIZATION AND DESI
COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler
More informationAdvanced issues in pipelining
Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one
More informationFour Steps of Speculative Tomasulo cycle 0
HW support for More ILP Hardware Speculative Execution Speculation: allow an instruction to issue that is dependent on branch, without any consequences (including exceptions) if branch is predicted incorrectly
More informationChapter 4 The Processor 1. Chapter 4A. The Processor
Chapter 4 The Processor 1 Chapter 4A The Processor Chapter 4 The Processor 2 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware
More informationECE331: Hardware Organization and Design
ECE331: Hardware Organization and Design Lecture 35: Final Exam Review Adapted from Computer Organization and Design, Patterson & Hennessy, UCB Material from Earlier in the Semester Throughput and latency
More informationComputer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining
Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining Single-Cycle Design Problems Assuming fixed-period clock every instruction datapath uses one
More informationHY425 Lecture 05: Branch Prediction
HY425 Lecture 05: Branch Prediction Dimitrios S. Nikolopoulos University of Crete and FORTH-ICS October 19, 2011 Dimitrios S. Nikolopoulos HY425 Lecture 05: Branch Prediction 1 / 45 Exploiting ILP in hardware
More informationCS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.
CS 2410 Mid term (fall 2015) Name: Question 1 (10 points) Indicate which of the following statements is true and which is false. (1) SMT architectures reduces the thread context switch time by saving in
More informationInstruction Pipelining Review
Instruction Pipelining Review Instruction pipelining is CPU implementation technique where multiple operations on a number of instructions are overlapped. An instruction execution pipeline involves a number
More informationCS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25
CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25 http://inst.eecs.berkeley.edu/~cs152/sp08 The problem
More informationInstruction Level Parallelism. Appendix C and Chapter 3, HP5e
Instruction Level Parallelism Appendix C and Chapter 3, HP5e Outline Pipelining, Hazards Branch prediction Static and Dynamic Scheduling Speculation Compiler techniques, VLIW Limits of ILP. Implementation
More informationAdvanced processor designs
Advanced processor designs We ve only scratched the surface of CPU design. Today we ll briefly introduce some of the big ideas and big words behind modern processors by looking at two example CPUs. The
More information3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?
CSE 2021: Computer Organization Single Cycle (Review) Lecture-10b CPU Design : Pipelining-1 Overview, Datapath and control Shakil M. Khan 2 Single Cycle with Jump Multi-Cycle Implementation Instruction:
More informationCS 61C: Great Ideas in Computer Architecture Pipelining and Hazards
CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards Instructors: Vladimir Stojanovic and Nicholas Weaver http://inst.eecs.berkeley.edu/~cs61c/sp16 1 Pipelined Execution Representation Time
More informationThe Processor. Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut. CSE3666: Introduction to Computer Architecture
The Processor Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut CSE3666: Introduction to Computer Architecture Introduction CPU performance factors Instruction count
More informationMultiple Instruction Issue. Superscalars
Multiple Instruction Issue Multiple instructions issued each cycle better performance increase instruction throughput decrease in CPI (below 1) greater hardware complexity, potentially longer wire lengths
More informationComputer Architecture
Lecture 3: Pipelining Iakovos Mavroidis Computer Science Department University of Crete 1 Previous Lecture Measurements and metrics : Performance, Cost, Dependability, Power Guidelines and principles in
More informationCPU Pipelining Issues
CPU Pipelining Issues What have you been beating your head against? This pipe stuff makes my head hurt! L17 Pipeline Issues & Memory 1 Pipelining Improve performance by increasing instruction throughput
More informationCS146 Computer Architecture. Fall Midterm Exam
CS146 Computer Architecture Fall 2002 Midterm Exam This exam is worth a total of 100 points. Note the point breakdown below and budget your time wisely. To maximize partial credit, show your work and state
More informationLecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1
Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1 Introduction Chapter 4.1 Chapter 4.2 Review: MIPS (RISC) Design Principles Simplicity favors regularity fixed size instructions small number
More informationAs the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.
Hiroaki Kobayashi // As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor. Branches will arrive up to n times faster in an n-issue processor, and providing an instruction
More informationLecture-13 (ROB and Multi-threading) CS422-Spring
Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue
More informationTDT 4260 lecture 7 spring semester 2015
1 TDT 4260 lecture 7 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU 2 Lecture overview Repetition Superscalar processor (out-of-order) Dependencies/forwarding
More informationPage # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer
CISC 662 Graduate Computer Architecture Lecture 8 - ILP 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer Architecture,
More informationPage 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer
CISC 662 Graduate Computer Architecture Lecture 8 - ILP 1 Michela Taufer Pipeline CPI http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson
More informationCS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin. School of Information Science and Technology SIST
CS 110 Computer Architecture Pipelining Guest Lecture: Shu Yin http://shtech.org/courses/ca/ School of Information Science and Technology SIST ShanghaiTech University Slides based on UC Berkley's CS61C
More information4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16
4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16 Instruction Execution Consider simplified MIPS: lw/sw rt, offset(rs) add/sub/and/or/slt
More informationCS425 Computer Systems Architecture
CS425 Computer Systems Architecture Fall 2017 Multiple Issue: Superscalar and VLIW CS425 - Vassilis Papaefstathiou 1 Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order
More informationCO Computer Architecture and Programming Languages CAPL. Lecture 18 & 19
CO2-3224 Computer Architecture and Programming Languages CAPL Lecture 8 & 9 Dr. Kinga Lipskoch Fall 27 Single Cycle Disadvantages & Advantages Uses the clock cycle inefficiently the clock cycle must be
More informationL19 Pipelined CPU I 1. Where are the registers? Study Chapter 6 of Text. Pipelined CPUs. Comp 411 Fall /07/07
Pipelined CPUs Where are the registers? Study Chapter 6 of Text L19 Pipelined CPU I 1 Review of CPU Performance MIPS = Millions of Instructions/Second MIPS = Freq CPI Freq = Clock Frequency, MHz CPI =
More informationChapter 4. The Processor
Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware 4.1 Introduction We will examine two MIPS implementations
More informationPipelined CPUs. Study Chapter 4 of Text. Where are the registers?
Pipelined CPUs Where are the registers? Study Chapter 4 of Text Second Quiz on Friday. Covers lectures 8-14. Open book, open note, no computers or calculators. L17 Pipelined CPU I 1 Review of CPU Performance
More informationImproving Performance: Pipelining
Improving Performance: Pipelining Memory General registers Memory ID EXE MEM WB Instruction Fetch (includes PC increment) ID Instruction Decode + fetching values from general purpose registers EXE EXEcute
More informationCS 152 Computer Architecture and Engineering
CS 152 Computer Architecture and Engineering Lecture 4 Testing Processors 2005-1-27 John Lazzaro (www.cs.berkeley.edu/~lazzaro) TAs: Ted Hong and David Marquardt www-inst.eecs.berkeley.edu/~cs152/ Last
More informationCS/CoE 1541 Mid Term Exam (Fall 2018).
CS/CoE 1541 Mid Term Exam (Fall 2018). Name: Question 1: (6+3+3+4+4=20 points) For this question, refer to the following pipeline architecture. a) Consider the execution of the following code (5 instructions)
More informationCMCS Mohamed Younis CMCS 611, Advanced Computer Architecture 1
CMCS 611-101 Advanced Computer Architecture Lecture 9 Pipeline Implementation Challenges October 5, 2009 www.csee.umbc.edu/~younis/cmsc611/cmsc611.htm Mohamed Younis CMCS 611, Advanced Computer Architecture
More informationFull Datapath. Chapter 4 The Processor 2
Pipelining Full Datapath Chapter 4 The Processor 2 Datapath With Control Chapter 4 The Processor 3 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory
More informationc. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations?
Brown University School of Engineering ENGN 164 Design of Computing Systems Professor Sherief Reda Homework 07. 140 points. Due Date: Monday May 12th in B&H 349 1. [30 points] Consider the non-pipelined
More informationLecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S
Lecture 6 MIPS R4000 and Instruction Level Parallelism Computer Architectures 521480S Case Study: MIPS R4000 (200 MHz, 64-bit instructions, MIPS-3 instruction set) 8 Stage Pipeline: first half of fetching
More informationCS 152 Computer Architecture and Engineering
CS 152 Computer Architecture and Engineering Lecture 10 -- Cache I 2014-2-20 John Lazzaro (not a prof - John is always OK) TA: Eric Love www-inst.eecs.berkeley.edu/~cs152/ Play: CS 152 L10: Cache I UC
More informationMinimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline
Instruction Pipelining Review: MIPS In-Order Single-Issue Integer Pipeline Performance of Pipelines with Stalls Pipeline Hazards Structural hazards Data hazards Minimizing Data hazard Stalls by Forwarding
More informationCISC 662 Graduate Computer Architecture Lecture 7 - Multi-cycles
CISC 662 Graduate Computer Architecture Lecture 7 - Multi-cycles Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More informationOrange Coast College. Business Division. Computer Science Department. CS 116- Computer Architecture. Pipelining
Orange Coast College Business Division Computer Science Department CS 116- Computer Architecture Pipelining Recall Pipelining is parallelizing execution Key to speedups in processors Split instruction
More informationAdvanced Computer Architecture
Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata What is a Processors Architecture Instruction Set Architecture (ISA) Describes
More informationRecall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls
CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring
More informationCS 152 Computer Architecture and Engineering
CS 152 Computer Architecture and Engineering Lecture 14 - Cache Design and Coherence 2014-3-6 John Lazzaro (not a prof - John is always OK) TA: Eric Love www-inst.eecs.berkeley.edu/~cs152/ Play: 1 Today:
More informationEITF20: Computer Architecture Part2.2.1: Pipeline-1
EITF20: Computer Architecture Part2.2.1: Pipeline-1 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle
More information14:332:331 Pipelined Datapath
14:332:331 Pipelined Datapath I n s t r. O r d e r Inst 0 Inst 1 Inst 2 Inst 3 Inst 4 Single Cycle Disadvantages & Advantages Uses the clock cycle inefficiently the clock cycle must be timed to accommodate
More informationChapter 4. The Processor
Chapter 4 The Processor 1 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A
More informationHardware-Based Speculation
Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register
More informationEECS 151/251A Fall 2017 Digital Design and Integrated Circuits. Instructor: John Wawrzynek and Nicholas Weaver. Lecture 13 EE141
EECS 151/251A Fall 2017 Digital Design and Integrated Circuits Instructor: John Wawrzynek and Nicholas Weaver Lecture 13 Project Introduction You will design and optimize a RISC-V processor Phase 1: Design
More informationCS 152 Computer Architecture and Engineering
CS 52 Computer Architecture and Engineering Lecture 6 -- Midterm I Review Session 204-3-3 John Lazzaro (not a prof - John is always OK) TA: Eric Love www-inst.eecs.berkeley.edu/~cs52/ Play: CS 52 L6: Midterm
More informationOutline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches
Session xploiting ILP with SW Approaches lectrical and Computer ngineering University of Alabama in Huntsville Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar,
More informationBranch Prediction & Speculative Execution. Branch Penalties in Modern Pipelines
6.823, L15--1 Branch Prediction & Speculative Execution Asanovic Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 6.823, L15--2 Branch Penalties in Modern Pipelines UltraSPARC-III
More informationINSTRUCTION LEVEL PARALLELISM
INSTRUCTION LEVEL PARALLELISM Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 2 and Appendix H, John L. Hennessy and David A. Patterson,
More informationInstr. execution impl. view
Pipelining Sangyeun Cho Computer Science Department Instr. execution impl. view Single (long) cycle implementation Multi-cycle implementation Pipelined implementation Processing an instruction Fetch instruction
More informationModern Computer Architecture
Modern Computer Architecture Lecture2 Pipelining: Basic and Intermediate Concepts Hongbin Sun 国家集成电路人才培养基地 Xi an Jiaotong University Pipelining: Its Natural! Laundry Example Ann, Brian, Cathy, Dave each
More informationCISC 662 Graduate Computer Architecture Lecture 7 - Multi-cycles. Interrupts and Exceptions. Device Interrupt (Say, arrival of network message)
CISC 662 Graduate Computer Architecture Lecture 7 - Multi-cycles Michela Taufer Interrupts and Exceptions http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy
More informationPage 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls
CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring
More informationCMSC 411 Computer Systems Architecture Lecture 6 Basic Pipelining 3. Complications With Long Instructions
CMSC 411 Computer Systems Architecture Lecture 6 Basic Pipelining 3 Long Instructions & MIPS Case Study Complications With Long Instructions So far, all MIPS instructions take 5 cycles But haven't talked
More informationInstruction word R0 R1 R2 R3 R4 R5 R6 R8 R12 R31
4.16 Exercises 419 Exercise 4.11 In this exercise we examine in detail how an instruction is executed in a single-cycle datapath. Problems in this exercise refer to a clock cycle in which the processor
More informationCS 152 Computer Architecture and Engineering
CS 152 Computer Architecture and Engineering Lecture 15 Cache II 2005-3-8 John Lazzaro (www.cs.berkeley.edu/~lazzaro) TAs: Ted Hong and David Marquardt www-inst.eecs.berkeley.edu/~cs152/ Last Time: Locality
More informationHardware-Based Speculation
Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register
More information