Architectures. Computer Architecture. CPU - Harvard. CPU Von Neumann. Calcolatori Elettronici II. Processor. Memory Hierarchy I/O INSTR/DATA

Size: px

Start display at page:

Download "Architectures. Computer Architecture. CPU - Harvard. CPU Von Neumann. Calcolatori Elettronici II. Processor. Memory Hierarchy I/O INSTR/DATA"

Zoe Page
6 years ago
Views:

1 Calcolatori Elettronici II Computer Architecture Processor Architectures architecture ctrl/dp pipeline issues and high-performance solutions Memory Hierarchy cache (L,L,L,...), central memory, mass memory, backup storage why, how I/O polling, interrupt, DMA A.A. 9/ CPU Von Neumann CPU - Harvard CTRL DP CTRL DP INSTR/ INSTR

2 CPU Pipeline issues CTRL-Unit Instruction Fetch Instruction Decode Branch prediction Hazard management Hazards stall Data hazard (RaR, RaW, WaR, WaW) data forwarding register renaming Datapath Register File Control hazard branch prediction pseudocode R = R-R if (R<) R= Functional units Adder, Multiplier, Divider, Logic functions,... Pipeline static dynamic conditional execution using branches SUB R, R, R ; R <- R-R BPL LABEL ; result is negative? MOV R, ; R <- LABEL:... ILP (Instruction Level Parallelism) speculative execution Structural hazard using conditional execution SUB R, R, R ; R <- R-R MOVMI R, ; R <- if result is negative Data forwarding Register renaming DIVF F6, F, F8 Static register renaming WaR Compiler IF ID OF EX ME WB SUBF F8, F, F WaW ADDF F6, F, F8 must take into account: branches subroutines rename registers DIVF S, F, F8 SUBF T, F, F RaW (data forwarding) Dynamic register renaming Reservation Station Tomasulo's algorithm IBM 6/9 FP unit ADDF F6, F, T

3 Branch prediction Branch prediction Branch Target Buffer Static STAT STAT DEST delay slots to fill (programmer/compiler) [MIPS] prediction bit Dynamic history prediction N N /T /T T /N /N T T T N N T,T N,T T,N N,N P T Miss always predict as not taken (do not insert in BTB if correctly predicted) always predict as taken (insert in BTB) taken if DEST < Branch prediction Branch Target Buffer Branch prediction Branch Target Buffer TAG IDX TAG IDX TAG STAT DEST TAG STAT DEST TAG STAT DEST { { n-ways

4 Conditional execution - Speculative execution Superscalar Conditional execution instruction is fetched but executed if a condition is true ARM IF ID OF EX EX ME WB ADDEQ R, R, R EX n Speculative execution both jump branches are executed wrong results are discarded Superscalar Superscalar HW multithreading IF ID IF ID OF EX ME OF EX WB ME WB Pipe IF ID+OF EX ME WB Thread A Instruction Thread C Instruction Thread A Instruction Thread C Instruction Thread A Instruction EX n Pipe Thread B Instruction Thread D Instruction Thread B Instruction Thread D Instruction Thread B Instruction

5 In-order execution EX In order start, in order end IF ID OF EX ME WB Reorder buffer EX n In order start, out of order end, in order write back Slow instructions cause stalls even with no hazards addf F, F,F 5 cycles mov R, R cycle History buffer In order start, out of order end and write back Reservation Shift Register Reservation Shift Register FU Rd V FU: Functional Unit used Rd: Destination Register V: Valid : Program Counter FU Rd V FU: Functional Unit used Rd: Destination Register V: Valid : Program Counter Instruction that requires k cycles is inserted in row k All position before k are marked as used At each cycle, data in RSR are shifted to up ( row) : mul R, R, R cycles : mov R, cycle 8: addf F, F, F cycles FU X mul Rd X R V X X In-order execution: in order start, in order end

6 Reservation Shift Register Reservation Shift Register FU Rd V FU: Functional Unit used Rd: Destination Register V: Valid : Program Counter FU Rd V FU: Functional Unit used Rd: Destination Register V: Valid : Program Counter : mul R, R, R cycles : mov R, cycle 8: addf F, F, F cycles FU mul Rd R V : mul R, R, R cycles : mov R, cycle 8: addf F, F, F cycles FU mov Rd R V Reservation Shift Register FU Rd V FU: Functional Unit used Rd: Destination Register V: Valid : Program Counter ReOrder Buffer FU V ROBptr RSR ROBptr: pointer to ROB entry Rd C RES ROB C: Completed RES: Result : mul R, R, R cycles : mov R, cycle 8: addf F, F, F cycles FU Rd V X X X X X X X X X addf F X X X 8 Instruction that requires k cycles is inserted in row k of RSR An entry in ROB is filled (not entirely) ROB is a circular buffer At each cycle, data in RSR are shifted to up ( row) When an instruction exits from RSR, result is written in ROB When an instruction exits from ROB, result is written in destination In-order execution: in order start, out of order end, in order write back

7 ReOrder Buffer ReOrder Buffer FU V addf ROBptr Rd C RES F C: Completed RES: Result FU V ROBptr add addf Rd C RES F R C: Completed RES: Result RSR head = tail = ROB RSR head = tail = ROB : addf F, F, F cycles : mov R, cycle 8: mul R, R, R cycles : addf F, F, F cycles : mov R, cycle 8: mul R, R, R cycles ReOrder Buffer ReOrder Buffer FU V addf ROBptr Rd C RES F R C: Completed RES: Result FU V addf mul ROBptr Rd C RES F R R 8 C: Completed RES: Result RSR head = tail = ROB RSR head = tail = ROB : addf F, F, F cycles : mov R, cycle 8: mul R, R, R cycles : addf F, F, F cycles : mov R, cycle 8: mul R, R, R cycles

8 ReOrder Buffer ReOrder Buffer FU V mul ROBptr Rd C RES F. R R 8 C: Completed RES: Result FU V mul ROBptr Rd C RES R R 8 C: Completed RES: Result RSR head = tail = ROB RSR head = tail = ROB : addf F, F, F cycles : mov R, cycle 8: mul R, R, R cycles Instruction in ROB() can exit write. in F : addf F, F, F cycles : mov R, cycle 8: mul R, R, R cycles ReOrder Buffer ReOrder Buffer FU V ROBptr Rd C RES R R 5 8 C: Completed RES: Result FU V mul ROBptr Rd C RES R R 5 8 C: Completed RES: Result RSR head = tail = ROB RSR head = tail = ROB : addf F, F, F cycles : mov R, cycle 8: mul R, R, R cycles : addf F, F, F cycles : mov R, cycle 8: mul R, R, R cycles Instruction in ROB() can exit write to R

9 Very Long Instruction Word History Buffer FU V HBptr Rd Rd C OLD OLD: Old destination value Parallelism is explicit in instructions Control simplified Compiler complex RSR allows faster WB HB High bandwitdh CPU/Memory Instruction that requires k cycles is inserted in row k of RSR An entry in ROB is filled (the current value of destination is saved in OLD) ROB is a circular buffer At each cycle, data in RSR are shifted to up ( row) When an instruction exits from RSR, result is written in destination Until an instruction is in HB, old data can be restored if needed (interrupt, exception, branch) op Rd Rsa Rsb op Rd Rsa Rsb op n Rd n Rsa n Rsb n FU FU FU n In-order execution: in order start, out of order end and write back Memory Hierarchy Memory Hierarchy CPU / Memory speed mismatch Memory access time - cycles high cost (area/energy/$) 5- cycles 5-5 cycles Many accesses for small areas Program characteristics: Predictability / Structure / Linear data structures / Sequential flow Principle of locality / Locality of reference Temporal locality an accessed memory location is likely to be accessed again in the near future Spatial locality if program accesses memory location X, it is probable that will access locations X±, X±, X±n (n small) Temporal locality Spatial locality

10 Memory Hierarchy Memory Hierarchy Small Expensive Fast CPU registers (one o more levels, on- and off-chip) RAM Mass storage (HDD, Flash) Backups (Tape) Big Cheap Slow Register Memory structure n- n Data in LOAD n Line precharge write D D D Address Row Decoder Memory Array LOAD LOAD IN LOAD CLK D OUT IN CLK M U X D OUT Sense amplifiers Data out read

11 Memory structure SRAM cell Data in Line precharge write BL P V DD P BL_b Address Row Decoder Memory Array WL N N GND N N Sense amplifiers MUX Data out read. precharge bitlines read: V DD / write V DD /+ V DD /-. address wordlines write: keep bitlines driven DRAM cell Memories Small BL WL Destructive read restore data after each read Need refresh ROM access time SRAM read write access time address setup time address stable before WR data setup time data stable before WR address hold time address stable after WR

12 Memories Memories DRAM multiplexed address, sent in phases (ROW,COL) DRAM RAS time (row address setup time) ROW stable before -RAS signal row address hold time ROW stable after -RAS signal CAS time (column address setup time) column address hold time RAS access time time between -RAS signal and data ready -CAS -RAS COL ROW or CAS access time RAS/CAS precharge time time between two accesses CAS time Column Address hold time RAS time RAS access time Memory Hierarchy ADDRESS = i ADDRESS CPU L L L REGS R A M H D D T A P E -Miss i N-

13 ADDRESS = i ADDRESS ADDRESS = i ADDRESS i = i -Hit i = Miss Penalty (time/energy) <Access> = Access cache (+MR MP) time/energy N- N- Read Hit Miss Write Hit Read data from next levels Read a whole line (exploits spatial locality) Write-through Write-back Miss Write-allocate Write-no-allocate Associative addressed by content CAM (content addressable memories) standard memories + control

14 Associative Direct-Mapped Lsize = #DSP (or block size) ADDRESS V ADDRESS COMP LINE HIT/MISS V LINE ADDRESS { TAG IDX DSP TAG } Lines = #IDX (or blocks) V: Valid COMP HIT/MISS MUX Size = Lines x Lsize ( Size) Actual size = Size + (TAG+V) Lines Set-Associative Set-Associative #TAG = #ADDRESS - #IDX - #DSP V TAG V TAG V TAG LINE { TAG IDX DSP ADDRESS COMP COMP COMP H H H MUX n MUX HIT = H + H H n Lines = #IDX Lsize = #DSP Size = nways x Lines x Lsize ( Size) Actual size = Size + (TAG+V) Lines nways = Direct-Mapped Lines = Associative

15 Replacement LRU counters or shift registers (nways x Lines) pseudo LRU FIFO Random Replacement LRU counters or shift registers (nways x Lines) access LRU stack for line i reg- reg- reg- reg- initial access counters for line i way- way- way- way- initial Insert the last accessed way, shift other values Reset the last accessed way's counter increment counters below the modified one Replacement LRU pseudo LRU ways, bits (B, B, B ) for each line: (B,B,B ) = x replace way ; (B,B,B ) = x (B,B,B ) = x replace way ; (B,B,B ) = x (B,B,B ) = x replace way ; (B,B,B ) = x (B,B,B ) = x replace way ; (B,B,B ) = x B B B replace way B = not B B = not B replace way B = not B B = not B replace way B = not B B = not B replace way B = not B B = not B Misses (-C's model) Compulsory cold-start miss Capacity miss in a fully associative cache Conflict (Collision) miss not happened in a fully associative cache associative caches do not have conflict misses too many conflicts: trashing conflict misses can avoid capacity misses

16 conflict misses can avoid capacity misses conflict misses can avoid capacity misses repeated, sequential accesses from to B (B+ bytes) repeated, sequential accesses from to B (B+ bytes). Associative cache, size B, LS= access to : miss (compulsory) insert the whole line (addresses,,,) access to,,: hit access to : miss (compulsory) insert the whole line (addresses,5,6,7)... access to B: miss (capacity) replace reference to addresses,,, access to : miss (capacity) replace reference to addresses,5,6,7 access to,,: hit access to : miss (capacity)... MR ~.5 ( MR = (B/ + )/(B+) ). DM cache, size B, LS=. Associative cache, size B, LS= MR =.5. DM cache, size B, LS= access to : miss (compulsory) insert the whole line (addresses,,,) access to,,: hit access to : miss (compulsory) insert the whole line (addresses,5,6,7)... access to B: miss (capacity) replace reference to addresses,,, access to : miss (capacity) replace reference to addresses,,, access to,,: hit access to : hit... MR = [.5 + N /(B+)] / (N+) /(B+) Rules of thumb MR(DM N ) ~ MR(-ways N/ ) x size ½ miss rate Enlarging Lsize: decrease MR increase MP SPEC9 Stack Distance program memory references addr, addr, addr,..., addr n push references in a stack (removing from stack if already present) stack distance of reference R position in stack (if present) (if not present)

17 Stack Distance program memory references,, 8,, 8 8 SD(8) = Stack Distance P HIT = D: stack distance L: Lines W: nways W a D a W L Hyp: uniform distribution of cache line access a LW L Da DM cache P HIT = L L D= P HIT = (two consecutive refs) D= access sequence: addr, other, addr miss if other has replaced addr P MISS = /L P HIT = -/L = (L-)/L D: prob other, other,..., other D has no evicted addr P HIT = P HIT (D=) D = [(L-)/L] D D Multi-level cache Special instructions Inclusive: data in L are in L, in... too Exclusive: data are in L, or in L, or... (only one) Mainly inclusive (intermediate) Victim cache Memory mapped Polling Interrupt DMA

18 Data P cache Memory Memory Address Devices are used reading and writing their internal registers Do not cache data from HW Data P cache Memory Memory Address Devices are used reading and writing their internal registers Do not cache data from HW HW I/O I/O Address Special instructions in, out e.g.: in R,x Address Data HW Enable HW I/O Special instructions in, out e.g.: in R,x Memory mapped e.g.: address xfe is R of HW address xfe is R of HW CLK Polling Device signals to check e.g.: READY, DEVREADY,... R is mapped at address xffff R(:) = (READY,DEVREADY) ; hw reg R: data MOV R, xffff L: LDR R, R ; read status (hw register R) TST R, ; data is ready? BEQ L ; no: read again MOV R, xffff LDR R, R ; read data (hw register R) very simple CPU time wasted Interrupt Program the device for data transfer Execute something else Get data when the device send a signal (interrupt) Interrupts have a priority

19 Daisy-Chained Interrupts Interrupt line CPU Interrupt line HW HW HW HW HW5 INT CTRL HW6 HW7 CPU Interrupt line ack HW ack HW ack HW Device signals an interrupt Device uses internal registers to show that is waiting to be served CPU reads HW register to find devices to handle Device signals an interrupt Device uses internal registers to show that is waiting to be served CPU reads HW register to find devices to handle Maskable Interrupt (IRQ) CPU can ignore interrupts instructions to mask / unmask interrupts Non Maskable Interrupt (NMI) always received critical events parity errors power off Interrupt Level-triggered interrupt line is kept high until the interrupt is handled if line is shared all interrupt must be served scan devices until a requesting one is found handle interrupt check the interrupt line again Interrupt Edge-triggered interrupt is signaled by a pulse if line is shared: check all devices (more pulses can be merged) if masked, interrupt can be lost latch to record pulses Interrupt Message-signaled

20 Interrupt handling. Finish current instruction. Save flags (not always) and return address. Signal interrupt handling. Find the handling routine The routine can depend on the interrupt line 5. Jump to routine. mask interrupts. access device. unmask interrupts. handle data transfer Precise Interrupt saved in a known position all instructions up to the current: executed current instruction: known state all instructions after the current: not executed or results discharged Imprecise interrupt Example: Interrupts in /AT CPU INTR INTA IMR IRR ISR 859A 859A: Intel programmable interrupt controller IMR IRR ISR 859A IMR: Interrupt Mask Register. 859A: INTR= IRR: Interrupt Request Register. CPU: INTA= (pulse) ISR: Interrupt Service Register. CPU: INTA= (pulse). 859A: data on data bus (8 bits) 5. CPU: jump to ISR (depends on data received) Example: Interrupt assignments in /AT Master 859 IRQ: System timer IRQ: Keyboard controller IRQ: to slave 859 IRQ: serial ports (COM and COM) IRQ: serial ports (COM and COM) IRQ5: parallel port LPT IRQ6: floppy disk controller IRQ7: parallel port LPT Slave 859 IRQ8: real-time clock (RTC) IRQ: mouse controller IRQ: math coprocessor IRQ: hd controller IRQ5: hd controller

21 DMA P cache HW Memory I/O Processor write device register to setup transfer memory pointer data size transfer type Device read/write data in memory with its own rate and latency Device send an interrupt Polling simple computationally expensive Interrupt CPU transfers data from device to memory an interrupt for each data word DMA an interrupt for each data block IO device must act as bus master

Pipeline issues. Pipeline hazard: RaW. Pipeline hazard: RaW. Calcolatori Elettronici e Sistemi Operativi. Hazards. Data hazard.

Pipeline issues. Pipeline hazard: RaW. Pipeline hazard: RaW. Calcolatori Elettronici e Sistemi Operativi. Hazards. Data hazard. Calcolatori Elettronici e Sistemi Operativi Pipeline issues Hazards Pipeline issues Data hazard Control hazard Structural hazard Pipeline hazard: RaW Pipeline hazard: RaW 5 6 7 8 9 5 6 7 8 9 : add R,R,R