COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Determined by ISA and compiler. Determined by CPU hardware

Size: px

Start display at page:

Download "COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Determined by ISA and compiler. Determined by CPU hardware"

Erick Derek Atkins
6 years ago
Views:

1 COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface ARM Editio Chapter 4 The Processor Itroductio CPU performace factors Istructio cout Determied by ISA ad compiler CPI ad Cycle time Determied by CPU hardware We will examie two LEGv8 implemetatios A simplified versio A more realistic pipelied versio Simple subset, shows most aspects Memory referece: LDUR, STUR Arithmetic/logical: add, sub, ad, or, slt Cotrol trasfer: beq, j 4.1 Itroductio Chapter 4 The Processor 2

2 Istructio Executio PC istructio memory, fetch istructio Register umbers register file, read registers Depedig o istructio class Use ALU to calculate Arithmetic result Memory address for load/store Brach target address Access data memory for load/store PC target address or PC + 4 Chapter 4 The Processor 3 CPU Overview Chapter 4 The Processor 4

3 Multiplexers Ca t just joi wires together Use multiplexers Chapter 4 The Processor 5 Cotrol Chapter 4 The Processor 6

4 Logic Desig Basics Iformatio ecoded i biary Low voltage = 0, High voltage = 1 Oe wire per bit Multi-bit data ecoded o multi-wire buses Combiatioal elemet Operate o data Output is a fuctio of iput State (sequetial) elemets Store iformatio 4.2 Logic Desig Covetios Chapter 4 The Processor 7 Combiatioal Elemets AND-gate Y = A & B Adder Y = A + B A B + Y A B I0 I1 M u x S Y Multiplexer Y = S? I1 : I0 Y Arithmetic/Logic Uit Y = F(A, B) A ALU Y B F Chapter 4 The Processor 8

5 Sequetial Elemets Register: stores data i a circuit Uses a clock sigal to determie whe to update the stored value Edge-triggered: update whe Clk chages from 0 to 1 D Clk Q Clk D Q Chapter 4 The Processor 9 Sequetial Elemets Register with write cotrol Oly updates o clock edge whe write cotrol iput is 1 Used whe stored value is required later Clk D Write Clk Q Write D Q Chapter 4 The Processor 10

Datapath Datapath Elemets that process data ad addresses i the CPU Registers, ALUs, mux s, memories, We

6 Clockig Methodology Combiatioal logic trasforms data durig clock cycles Betwee clock edges Iput from state elemets, output to state elemet Logest delay determies clock period Chapter 4 The Processor 11 Buildig a Datapath Datapath Elemets that process data ad addresses i the CPU Registers, ALUs, mux s, memories, We will build a LEGv8 datapath icremetally Refiig the overview desig 4.3 Buildig a Datapath Chapter 4 The Processor 12

Istructios Read two register operads Perform

7 Istructio Fetch 32-bit register Icremet by 4 for ext istructio Chapter 4 The Processor 13 R-Format Istructios Read two register operads Perform arithmetic/logical operatio Write register result Chapter 4 The Processor 14

register operads Compare operads Use ALU, subtract ad check Zero output Calculate target address Sig-exted

8 Load/Store Istructios Read register operads Calculate address usig 16-bit offset Use ALU, but sig-exted offset Load: Read memory ad update register Store: Write register value to memory Chapter 4 The Processor 15 Brach Istructios Read register operads Compare operads Use ALU, subtract ad check Zero output Calculate target address Sig-exted displacemet Shift left 2 places (word displacemet) Add to PC + 4 Already calculated by istructio fetch Chapter 4 The Processor 16

9 Brach Istructios Just re-routes wires Sig-bit wire replicated Chapter 4 The Processor 17 Composig the Elemets First-cut data path does a istructio i oe clock cycle Each datapath elemet ca oly do oe fuctio at a time Hece, we eed separate istructio ad data memories Use multiplexers where alterate data sources are used for differet istructios Chapter 4 The Processor 18

10 R-Type/Load/Store Datapath Chapter 4 The Processor 19 Full Datapath Chapter 4 The Processor 20

11 ALU Cotrol ALU used for Load/Store: F = add Brach: F = subtract R-type: F depeds o opcode ALU cotrol Fuctio 0000 AND 0001 OR 0010 add 0110 subtract 0111 pass iput b 1100 NOR 4.4 A Simple Implemetatio Scheme Chapter 4 The Processor 21 ALU Cotrol Assume 2-bit ALUOp derived from opcode Combiatioal logic derives ALU cotrol opcode ALUOp Operatio Opcode field ALU fuctio ALU cotrol LDUR 00 load register XXXXXXXXXXX add 0010 STUR 00 store register XXXXXXXXXXX add 0010 CBZ 01 compare ad brach o zero XXXXXXXXXXX pass iput b 0111 R-type 10 add add 0010 subtract subtract 0110 AND AND 0000 ORR OR 0001 Chapter 4 The Processor 22

12 The Mai Cotrol Uit Cotrol sigals derived from istructio Chapter 4 The Processor 23 Datapath With Cotrol Chapter 4 The Processor 24

13 R-Type Istructio Chapter 4 The Processor 25 Load Istructio Chapter 4 The Processor 26

14 CBZ Istructio Chapter 4 The Processor 27 Implemetig Ucd l Brach Jump 2 address Jump uses word address Update PC with cocateatio of Top 4 bits of old PC 26-bit jump address 00 31:26 25:0 Need a extra cotrol sigal decoded from opcode Chapter 4 The Processor 28

register file Not feasible to vary period for differet istructios Violates desig priciple

15 Datapath With B Added Chapter 4 The Processor 29 Performace Issues Logest delay determies clock period Critical path: load istructio Istructio memory register file ALU data memory register file Not feasible to vary period for differet istructios Violates desig priciple Makig the commo case fast We will improve performace by pipeliig Chapter 4 The Processor 30

Pipeliig Aalogy Pipelied laudry: overlappig executio Parallelism improves performace Four loads: Speedup = 8/3.5 = 2.3 No-stop: Speedup = 2/0.5 + 1.5 4 = umber of stages 4.

16 Pipeliig Aalogy Pipelied laudry: overlappig executio Parallelism improves performace Four loads: Speedup = 8/3.5 = 2.3 No-stop: Speedup = 2/ = umber of stages 4.5 A Overview of Pipeliig Chapter 4 The Processor 31 LEGv8 Pipelie Five stages, oe step per stage 1. IF: Istructio fetch from memory 2. ID: Istructio decode & register read 3. EX: Execute operatio or calculate address 4. MEM: Access memory operad 5. WB: Write result back to register Chapter 4 The Processor 32

Pipelie Performace Assume time for stages is 100ps for register read or write 200ps for other stages Compare pipelied datapath with sigle-cycle datapath Istr Istr fetch Register read ALU op Memory

17 Pipelie Performace Assume time for stages is 100ps for register read or write 200ps for other stages Compare pipelied datapath with sigle-cycle datapath Istr Istr fetch Register read ALU op Memory access Register write Total time LDUR 200ps 100 ps 200ps 200ps 100 ps 800ps STUR 200ps 100 ps 200ps 200ps 700ps R-format 200ps 100 ps 200ps 100 ps 600ps CBZ 200ps 100 ps 200ps 500ps Chapter 4 The Processor 33 Pipelie Performace Sigle-cycle (T c = 800ps) Pipelied (T c = 200ps) Chapter 4 The Processor 34

18 Pipelie Speedup If all stages are balaced i.e., all take the same time Time betwee istructios pipelied = Time betwee istructios opipelied Number of stages If ot balaced, speedup is less Speedup due to icreased throughput Latecy (time for each istructio) does ot decrease Chapter 4 The Processor 35 Pipeliig ad ISA Desig LEGv8 ISA desiged for pipeliig All istructios are 32-bits Easier to fetch ad decode i oe cycle c.f. x86: 1- to 17-byte istructios Few ad regular istructio formats Ca decode ad read registers i oe step Load/store addressig Ca calculate address i 3 rd stage, access memory i 4 th stage Aligmet of memory operads Memory access takes oly oe cycle Chapter 4 The Processor 36

19 Hazards Situatios that prevet startig the ext istructio i the ext cycle Structure hazards A required resource is busy Data hazard Need to wait for previous istructio to complete its data read/write Cotrol hazard Decidig o cotrol actio depeds o previous istructio Chapter 4 The Processor 37 Structure Hazards Coflict for use of a resource I LEGv8 pipelie with a sigle memory Load/store requires data access Istructio fetch would have to stall for that cycle Would cause a pipelie bubble Hece, pipelied datapaths require separate istructio/data memories Or separate istructio/data caches Chapter 4 The Processor 38

20 Data Hazards A istructio depeds o completio of data access by a previous istructio ADD X19, X0, X1 SUB X2, X19, X3 Chapter 4 The Processor 39 Forwardig (aka Bypassig) Use result whe it is computed Do t wait for it to be stored i a register Requires extra coectios i the datapath Chapter 4 The Processor 40

21 Load-Use Data Hazard Ca t always avoid stalls by forwardig If value ot computed whe eeded Ca t forward backward i time! Chapter 4 The Processor 41 Code Schedulig to Avoid Stalls Reorder code to avoid use of load result i the ext istructio C code for A = B + E; C = B + F; LDUR X1, [X0,#0] LDUR X1, [X0,#0] LDUR X2, [X0,#8] LDUR X2, [X0,#8] stall ADD STUR X3, X1, X2 X3, [X0,#24] LDUR ADD X4, [X0,#16] X3, X1, X2 LDUR X4, [X0,#16] STUR X3, [X0,#24] stall ADD STUR X5, X1, X4 X5, [X0,#32] ADD STUR X5, X1, X4 X5, [X0,#32] 13 cycles 11 cycles Chapter 4 The Processor 42

22 Cotrol Hazards Brach determies flow of cotrol Fetchig ext istructio depeds o brach outcome Pipelie ca t always fetch correct istructio Still workig o ID stage of brach I LEGv8 pipelie Need to compare registers ad compute target early i the pipelie Add hardware to do it i ID stage Chapter 4 The Processor 43 Stall o Brach Wait util brach outcome determied before fetchig ext istructio Chapter 4 The Processor 44

23 Brach Predictio Loger pipelies ca t readily determie brach outcome early Stall pealty becomes uacceptable Predict outcome of brach Oly stall if predictio is wrog I LEGv8 pipelie Ca predict braches ot take Fetch istructio after brach, with o delay Chapter 4 The Processor 45 More-Realistic Brach Predictio Static brach predictio Based o typical brach behavior Example: loop ad if-statemet braches Predict backward braches take Predict forward braches ot take Dyamic brach predictio Hardware measures actual brach behavior e.g., record recet history of each brach Assume future behavior will cotiue the tred Whe wrog, stall while re-fetchig, ad update history Chapter 4 The Processor 46

24 Pipelie Summary The BIG Picture Pipeliig improves performace by icreasig istructio throughput Executes multiple istructios i parallel Each istructio has the same latecy Subject to hazards Structure, data, cotrol Istructio set desig affects complexity of pipelie implemetatio Chapter 4 The Processor 47 LEGv8 Pipelied Datapath 4.6 Pipelied Datapath ad Cotrol MEM Right-to-left flow leads to hazards WB Chapter 4 The Processor 48

Sigle-clock-cycle pipelie diagram Shows pipelie usage i a sigle cycle Highlight resources used c.f.

25 Pipelie registers Need registers betwee stages To hold iformatio produced i previous cycle Chapter 4 The Processor 49 Pipelie Operatio Cycle-by-cycle flow of istructios through the pipelied datapath Sigle-clock-cycle pipelie diagram Shows pipelie usage i a sigle cycle Highlight resources used c.f. multi-clock-cycle diagram Graph of operatio over time We ll look at sigle-clock-cycle diagrams for load & store Chapter 4 The Processor 50

26 IF for Load, Store, Chapter 4 The Processor 51 ID for Load, Store, Chapter 4 The Processor 52

27 EX for Load Chapter 4 The Processor 53 MEM for Load Chapter 4 The Processor 54

28 WB for Load Wrog register umber Chapter 4 The Processor 55 Corrected Datapath for Load Chapter 4 The Processor 56

29 EX for Store Chapter 4 The Processor 57 MEM for Store Chapter 4 The Processor 58

30 WB for Store Chapter 4 The Processor 59 Multi-Cycle Pipelie Diagram Form showig resource usage Chapter 4 The Processor 60

31 Multi-Cycle Pipelie Diagram Traditioal form Chapter 4 The Processor 61 Sigle-Cycle Pipelie Diagram State of pipelie i a give cycle Chapter 4 The Processor 62

32 Pipelied Cotrol (Simplified) Chapter 4 The Processor 63 Pipelied Cotrol Cotrol sigals derived from istructio As i sigle-cycle implemetatio Chapter 4 The Processor 64

STUR X15,[X2,#100] We ca resolve hazards with forwardig How do we detect whe

33 Pipelied Cotrol Chapter 4 The Processor 65 Data Hazards i ALU Istructios Cosider this sequece: SUB X2, X1,X3 AND X12,X2,X5 OR X13,X6,X2 ADD X14,X2,X2 STUR X15,[X2,#100] We ca resolve hazards with forwardig How do we detect whe to forward? 4.7 Data Hazards: Forwardig vs. Stallig Chapter 4 The Processor 66

34 Depedecies & Forwardig Chapter 4 The Processor 67 Detectig the Need to Forward Pass register umbers alog pipelie e.g., ID/EX.RegisterRs = register umber for Rs sittig i ID/EX pipelie register ALU operad register umbers i EX stage are give by ID/EX.RegisterR1, ID/EX.RegisterRm2 Data hazards whe 1a. EX/MEM.RegisterRd = ID/EX.RegisterR1 1b. EX/MEM.RegisterRd = ID/EX.RegisterRm2 2a. MEM/WB.RegisterRd = ID/EX.RegisterR1 2b. MEM/WB.RegisterRd = ID/EX.RegisterRm2 Fwd from EX/MEM pipelie reg Fwd from MEM/WB pipelie reg Chapter 4 The Processor 68

35 Detectig the Need to Forward But oly if forwardig istructio will write to a register! EX/MEM.RegWrite, MEM/WB.RegWrite Ad oly if Rd for that istructio is ot XZR EX/MEM.RegisterRd 31, MEM/WB.RegisterRd 31 Chapter 4 The Processor 69 Forwardig Paths Chapter 4 The Processor 70

36 Forwardig Coditios Mux cotrol Source Explaatio ForwardA = 00 ID/EX The first ALU operad comes from the register file. ForwardA = 10 EX/MEM The first ALU operad is forwarded from the prior ALU result. ForwardA = 01 MEM/WB The first ALU operad is forwarded from data memory or a earlier ALU result. ForwardB = 00 ID/EX The secod ALU operad comes from the register file. ForwardB = 10 EX/MEM The secod ALU operad is forwarded from the prior ALU result. ForwardB = 01 MEM/WB The secod ALU operad is forwarded from data memory or a earlier ALU result. Chapter 4 The Processor 71 Double Data Hazard Cosider the sequece: add X1,X1,X2 add X1,X1,X3 add X1,X1,X4 Both hazards occur Wat to use the most recet Revise MEM hazard coditio Oly fwd if EX hazard coditio is t true Chapter 4 The Processor 72

37 Revised Forwardig Coditio MEM hazard if (MEM/WB.RegWrite ad (MEM/WB.RegisterRd 31) ad ot(ex/mem.regwrite ad (EX/MEM.RegisterRd 31) ad (EX/MEM.RegisterRd ID/EX.RegisterR1)) ad (MEM/WB.RegisterRd = ID/EX.RegisterR1)) ForwardA = 01 if (MEM/WB.RegWrite ad (MEM/WB.RegisterRd 31) ad ot(ex/mem.regwrite ad (EX/MEM.RegisterRd 31) ad (EX/MEM.RegisterRd ID/EX.RegisterRm2)) ad (MEM/WB.RegisterRd = ID/EX.RegisterRm2)) ForwardB = 01 Chapter 4 The Processor 73 Datapath with Forwardig Chapter 4 The Processor 74

38 Load-Use Hazard Detectio Check whe usig istructio is decoded i ID stage ALU operad register umbers i ID stage are give by IF/ID.RegisterR1, IF/ID.RegisterRm2 Load-use hazard whe ID/EX.MemRead ad ((ID/EX.RegisterRd = IF/ID.RegisterR1) or (ID/EX.RegisterRd = IF/ID.RegisterRm1)) If detected, stall ad isert bubble Chapter 4 The Processor 75 How to Stall the Pipelie Force cotrol values i ID/EX register to 0 EX, MEM ad WB do op (o-operatio) Prevet update of PC ad IF/ID register Usig istructio is decoded agai Followig istructio is fetched agai 1-cycle stall allows MEM to read data for LDUI Ca subsequetly forward to EX stage Chapter 4 The Processor 76

39 Load-Use Data Hazard Stall iserted here Chapter 4 The Processor 77 Datapath with Hazard Detectio Chapter 4 The Processor 78

40 Stalls ad Performace The BIG Picture Stalls reduce performace But are required to get correct results Compiler ca arrage code to avoid hazards ad stalls Requires kowledge of the pipelie structure Chapter 4 The Processor 79 Brach Hazards If brach outcome determied i MEM 4.8 Cotrol Hazards Flush these istructios (Set cotrol values to 0) PC Chapter 4 The Processor 80

41 Reducig Brach Delay Move hardware to determie outcome to ID stage Target address adder Register comparator Example: brach take 36: SUB X10, X4, X8 40: CBZ X1, X3, 8 44: AND X12, X2, X5 48: ORR X13, X2, X6 52: ADD X14, X4, X2 56: SUB X15, X6, X : LDUR X4, [X7,#50] Chapter 4 The Processor 81 Example: Brach Take Chapter 4 The Processor 82

brach istructio addresses Stores outcome (take/ot take) To execute a brach Check table, expect the same

42 Example: Brach Take Chapter 4 The Processor 83 Dyamic Brach Predictio I deeper ad superscalar pipelies, brach pealty is more sigificat Use dyamic predictio Brach predictio buffer (aka brach history table) Idexed by recet brach istructio addresses Stores outcome (take/ot take) To execute a brach Check table, expect the same outcome Start fetchig from fall-through or target If wrog, flush pipelie ad flip predictio Chapter 4 The Processor 84

The mispredict as ot take o first iteratio of ier loop ext time aroud Chapter 4 The

43 1-Bit Predictor: Shortcomig Ier loop braches mispredicted twice! outer: ier: CBZ,, ier CBZ,, outer Mispredict as take o last iteratio of ier loop The mispredict as ot take o first iteratio of ier loop ext time aroud Chapter 4 The Processor 85 2-Bit Predictor Oly chage predictio o two successive mispredictios Chapter 4 The Processor 86

44 Calculatig the Brach Target Eve with predictor, still eed to calculate the target address 1-cycle pealty for a take brach Brach target buffer Cache of target addresses Idexed by PC whe istructio fetched If hit ad istructio is brach predicted take, ca fetch target immediately Chapter 4 The Processor 87 Exceptios ad Iterrupts Uexpected evets requirig chage i flow of cotrol Differet ISAs use the terms differetly Exceptio Arises withi the CPU Iterrupt e.g., udefied opcode, overflow, syscall, From a exteral I/O cotroller Dealig with them without sacrificig performace is hard 4.9 Exceptios Chapter 4 The Processor 88

45 Hadlig Exceptios Save PC of offedig (or iterrupted) istructio I LEGv8: Exceptio Lik Register (ELR) Save idicatio of the problem I LEGv8: Exceptio Sydrome Rregister (ESR) We ll assume 1-bit 0 for udefied opcode, 1 for overflow Chapter 4 The Processor 89 A Alterate Mechaism Vectored Iterrupts Hadler address determied by the cause Exceptio vector address to be added to a vector table base register: Ukow Reaso: Overflow: : Istructios either Deal with the iterrupt, or Jump to real hadler two two two Chapter 4 The Processor 90

46 Hadler Actios Read cause, ad trasfer to relevat hadler Determie actio required If restartable Take corrective actio use EPC to retur to program Otherwise Termiate program Report error usig EPC, cause, Chapter 4 The Processor 91 Exceptios i a Pipelie Aother form of cotrol hazard Cosider overflow o add i EX stage ADD X1, X2, X1 Prevet X1 from beig clobbered Complete previous istructios Flush add ad subsequet istructios Set ESR ad ELR register values Trasfer cotrol to hadler Similar to mispredicted brach Use much of the same hardware Chapter 4 The Processor 92

to the istructio Refetched ad executed from scratch PC saved i ELR register

47 Pipelie with Exceptios Chapter 4 The Processor 93 Exceptio Properties Restartable exceptios Pipelie ca flush the istructio Hadler executes, the returs to the istructio Refetched ad executed from scratch PC saved i ELR register Idetifies causig istructio Actually PC + 4 is saved Hadler must adjust Chapter 4 The Processor 94

48 Exceptio Example Exceptio o ADD i 40 SUB X11, X2, X4 44 AND X12, X2, X5 48 ORR X13, X2, X6 4C ADD X1, X2, X1 50 SUB X15, X6, X7 54 LDUR X16, [X7,#100] Hadler STUR X26, [X0,#1000] STUR X27, [X0,#1008] Chapter 4 The Processor 95 Exceptio Example Chapter 4 The Processor 96

Exceptio Example Chapter 4 The Processor 97 Multiple Exceptios Pipeliig overlaps multiple istructios Could have multiple exceptios at oce Simple approach: deal with exceptio from earliest

49 Exceptio Example Chapter 4 The Processor 97 Multiple Exceptios Pipeliig overlaps multiple istructios Could have multiple exceptios at oce Simple approach: deal with exceptio from earliest istructio Flush subsequet istructios Precise exceptios I complex pipelies Multiple istructios issued per cycle Out-of-order completio Maitaiig precise exceptios is difficult! Chapter 4 The Processor 98

50 Imprecise Exceptios Just stop pipelie ad save state Icludig exceptio cause(s) Let the hadler work out Which istructio(s) had exceptios Which to complete or flush May require maual completio Simplifies hardware, but more complex hadler software Not feasible for complex multiple-issue out-of-order pipelies Chapter 4 The Processor 99 Istructio-Level Parallelism (ILP) Pipeliig: executig multiple istructios i parallel To icrease ILP Deeper pipelie Less work per stage Þ shorter clock cycle Multiple issue Replicate pipelie stages Þ multiple pipelies Start multiple istructios per clock cycle CPI < 1, so use Istructios Per Cycle (IPC) E.g., 4GHz 4-way multiple-issue 16 BIPS, peak CPI = 0.25, peak IPC = 4 But depedecies reduce this i practice 4.10 Parallelism via Istructios Chapter 4 The Processor 100

51 Multiple Issue Static multiple issue Compiler groups istructios to be issued together Packages them ito issue slots Compiler detects ad avoids hazards Dyamic multiple issue CPU examies istructio stream ad chooses istructios to issue each cycle Compiler ca help by reorderig istructios CPU resolves hazards usig advaced techiques at rutime Chapter 4 The Processor 101 Speculatio Guess what to do with a istructio Start operatio as soo as possible Check whether guess was right If so, complete the operatio If ot, roll-back ad do the right thig Commo to static ad dyamic multiple issue Examples Speculate o brach outcome Roll back if path take is differet Speculate o load Roll back if locatio is updated Chapter 4 The Processor 102

52 Compiler/Hardware Speculatio Compiler ca reorder istructios e.g., move load before brach Ca iclude fix-up istructios to recover from icorrect guess Hardware ca look ahead for istructios to execute Buffer results util it determies they are actually eeded Flush buffers o icorrect speculatio Chapter 4 The Processor 103 Speculatio ad Exceptios What if exceptio occurs o a speculatively executed istructio? e.g., speculative load before ull-poiter check Static speculatio Ca add ISA support for deferrig exceptios Dyamic speculatio Ca buffer exceptios util istructio completio (which may ot occur) Chapter 4 The Processor 104

53 Static Multiple Issue Compiler groups istructios ito issue packets Group of istructios that ca be issued o a sigle cycle Determied by pipelie resources required Thik of a issue packet as a very log istructio Specifies multiple cocurret operatios Þ Very Log Istructio Word (VLIW) Chapter 4 The Processor 105 Schedulig Static Multiple Issue Compiler must remove some/all hazards Reorder istructios ito issue packets No depedecies with a packet Possibly some depedecies betwee packets Varies betwee ISAs; compiler must kow! Pad with op if ecessary Chapter 4 The Processor 106

WB + 4 Load/store IF ID EX MEM WB + 8 ALU/brach IF ID EX MEM WB + 12 Load/store IF ID EX MEM WB + 16 ALU/brach IF ID

54 LEGv8 with Static Dual Issue Two-issue packets Oe ALU/brach istructio Oe load/store istructio 64-bit aliged ALU/brach, the load/store Pad a uused istructio with op Address Istructio type Pipelie Stages ALU/brach IF ID EX MEM WB + 4 Load/store IF ID EX MEM WB + 8 ALU/brach IF ID EX MEM WB + 12 Load/store IF ID EX MEM WB + 16 ALU/brach IF ID EX MEM WB + 20 Load/store IF ID EX MEM WB Chapter 4 The Processor 107 LEGv8 with Static Dual Issue Chapter 4 The Processor 108

55 Hazards i the Dual-Issue LEGv8 More istructios executig i parallel EX data hazard Forwardig avoided stalls with sigle-issue Now ca t use ALU result i load/store i same packet ADD X0, X0, X1 LDUR X2, [X0,#0] Split ito two packets, effectively a stall Load-use hazard Still oe cycle use latecy, but ow two istructios More aggressive schedulig required Chapter 4 The Processor 109 Schedulig Example Schedule this for dual-issue LEGv8 Loop: LDUR X0, [X20,#0] ADD X0, X0,X21 STUR X0, [X20,#0] SUBI X20, X20,#4 CMP X20, X22 BGT Loop // X0=array elemet // add scalar i X21 // store result // decremet poiter // brach $s1!=0 ALU/brach Load/store cycle Loop: op LDUR X0, [X20,#0] 1 SUBI X20, X20,#4 op 2 ADD X0, X0,X21 op 3 CMP X20, X22 sw $t0, 4($s1) 4 BGT Loop STUR X0, [X20,#0] 5 IPC = 7/6 = 1.17 (c.f. peak IPC = 2) Chapter 4 The Processor 110

56 Loop Urollig Replicate loop body to expose more parallelism Reduces loop-cotrol overhead Use differet registers per replicatio Called register reamig Avoid loop-carried ati-depedecies Store followed by a load of the same register Aka ame depedece Reuse of a register ame Chapter 4 The Processor 111 Loop Urollig Example ALU/brach Load/store cycle Loop: SUBI X20, X20,#32 LDUR X0, [X20,#0] 1 op LDUR X1, [X20,#24] 2 ADD X0, X0, X21 LDUR X2, [X20,#16] 3 ADD X1, X1, X21 LDUR X3, [X20,#8] 4 ADD X2, X2, X21 STUR X0, [X20,#32] 5 ADD X3, X3, X21 sw X1, [X20,#24] 6 CMP X20,X22 sw X2, [X20,#16] 7 BGT Loop sw X3, [X20,#8] 8 IPC = 15/8 = Closer to 2, but at cost of registers ad code size Chapter 4 The Processor 112

57 Dyamic Multiple Issue Superscalar processors CPU decides whether to issue 0, 1, 2, each cycle Avoidig structural ad data hazards Avoids the eed for compiler schedulig Though it may still help Code sematics esured by the CPU Chapter 4 The Processor 113 Dyamic Pipelie Schedulig Allow the CPU to execute istructios out of order to avoid stalls But commit result to registers i order Example LDUR X0, [X21,#20] ADD X1, X0, X2 SUB X23,X23,X3 ANDI X5, X23,#20 Ca start sub while ADD is waitig for LDUI Chapter 4 The Processor 114

Dyamically Scheduled CPU Preserves depedecies Hold pedig operads Results also set to ay waitig reservatio statios Reorders buffer for register writes Ca

istructio issue to reservatio statio If operad is available i register file or reorder buffer Copied to reservatio statio No loger required i the register; ca

58 Dyamically Scheduled CPU Preserves depedecies Hold pedig operads Results also set to ay waitig reservatio statios Reorders buffer for register writes Ca supply operads for issued istructios Chapter 4 The Processor 115 Register Reamig Reservatio statios ad reorder buffer effectively provide register reamig O istructio issue to reservatio statio If operad is available i register file or reorder buffer Copied to reservatio statio No loger required i the register; ca be overwritte If operad is ot yet available It will be provided to the reservatio statio by a fuctio uit Register update may ot be required Chapter 4 The Processor 116

59 Speculatio Predict brach ad cotiue issuig Do t commit util brach outcome determied Load speculatio Avoid load ad cache miss delay Predict the effective address Predict loaded value Load before completig outstadig stores Bypass stored values to load uit Do t commit load util speculatio cleared Chapter 4 The Processor 117 Why Do Dyamic Schedulig? Why ot just let the compiler schedule code? Not all stalls are predicable e.g., cache misses Ca t always schedule aroud braches Brach outcome is dyamically determied Differet implemetatios of a ISA have differet latecies ad hazards Chapter 4 The Processor 118

60 Does Multiple Issue Work? The BIG Picture Yes, but ot as much as we d like Programs have real depedecies that limit ILP Some depedecies are hard to elimiate e.g., poiter aliasig Some parallelism is hard to expose Limited widow size durig istructio issue Memory delays ad limited badwidth Hard to keep pipelies full Speculatio ca help if doe well Chapter 4 The Processor 119 Power Efficiecy Complexity of dyamic schedulig ad speculatios requires power Multiple simpler cores may be better Microprocessor Year Clock Rate Pipelie Stages Issue width Out-of-order/ Speculatio Cores i MHz 5 1 No 1 5W Power Petium MHz 5 2 No 1 10W Petium Pro MHz 10 3 Yes 1 29W P4 Willamette MHz 22 3 Yes 1 75W P4 Prescott MHz 31 3 Yes 1 103W Core MHz 14 4 Yes 2 75W UltraSparc III MHz 14 4 No 1 90W UltraSparc T MHz 6 1 No 8 70W Chapter 4 The Processor 120

61 Cortex A53 ad Itel i7 Processor ARM A53 Itel Core i7 920 Market Persoal Mobile Device Server, cloud Thermal desig power 100 milliwatts (1 1 GHz) 130 Watts Clock rate 1.5 GHz 2.66 GHz Cores/Chip 4 (cofigurable) 4 Floatig poit? Yes Yes Multiple issue? Dyamic Dyamic Peak istructios/clock cycle 2 4 Pipelie stages 8 14 Pipelie schedule Static i-order Dyamic out-of-order with speculatio Brach predictio Hybrid 2-level 1 st level caches/core KiB I, KiB D 32 KiB I, 32 KiB D 2 d level caches/core KiB 256 KiB (per core) 3 rd level caches (shared) (platform depedet) 2-8 MB Chapter 4 The Processor Real Stuff: The ARM Cortex-A8 ad Itel Core i7 Pipelies ARM Cortex-A53 Pipelie Chapter 4 The Processor 122

62 ARM Cortex-A53 Performace Chapter 4 The Processor 123 Core i7 Pipelie Chapter 4 The Processor 124

63 Core i7 Performace Chapter 4 The Processor 125 Matrix Multiply Urolled C code 1 #iclude <x86itri.h> 2 #defie UNROLL (4) 3 4 void dgemm (it, double* A, double* B, double* C) 5 { 6 for ( it i = 0; i < ; i+=unroll*4 ) 7 for ( it j = 0; j < ; j++ ) { 8 m256d c[4]; 9 for ( it x = 0; x < UNROLL; x++ ) 10 c[x] = _mm256_load_pd(c+i+x*4+j*); for( it k = 0; k < ; k++ ) 13 { 14 m256d b = _mm256_broadcast_sd(b+k+j*); 15 for (it x = 0; x < UNROLL; x++) 16 c[x] = _mm256_add_pd(c[x], 17 _mm256_mul_pd(_mm256_load_pd(a+*k+x*4+i), b)); 18 } for ( it x = 0; x < UNROLL; x++ ) 21 _mm256_store_pd(c+i+x*4+j*, c[x]); 22 } 23 } 4.12 Istructio-Level Parallelism ad Matrix Multiply Chapter 4 The Processor 126

64 Matrix Multiply Assembly code: 1 vmovapd (%r11),%ymm4 # Load 4 elemets of C ito %ymm4 2 mov %rbx,%rax # register %rax = %rbx 3 xor %ecx,%ecx # register %ecx = 0 4 vmovapd 0x20(%r11),%ymm3 # Load 4 elemets of C ito %ymm3 5 vmovapd 0x40(%r11),%ymm2 # Load 4 elemets of C ito %ymm2 6 vmovapd 0x60(%r11),%ymm1 # Load 4 elemets of C ito %ymm1 7 vbroadcastsd (%rcx,%r9,1),%ymm0 # Make 4 copies of B elemet 8 add $0x8,%rcx # register %rcx = %rcx vmulpd (%rax),%ymm0,%ymm5 # Parallel mul %ymm1,4 A elemets 10 vaddpd %ymm5,%ymm4,%ymm4 # Parallel add %ymm5, %ymm4 11 vmulpd 0x20(%rax),%ymm0,%ymm5 # Parallel mul %ymm1,4 A elemets 12 vaddpd %ymm5,%ymm3,%ymm3 # Parallel add %ymm5, %ymm3 13 vmulpd 0x40(%rax),%ymm0,%ymm5 # Parallel mul %ymm1,4 A elemets 14 vmulpd 0x60(%rax),%ymm0,%ymm0 # Parallel mul %ymm1,4 A elemets 15 add %r8,%rax # register %rax = %rax + %r8 16 cmp %r10,%rcx # compare %r8 to %rax 17 vaddpd %ymm5,%ymm2,%ymm2 # Parallel add %ymm5, %ymm2 18 vaddpd %ymm0,%ymm1,%ymm1 # Parallel add %ymm0, %ymm1 19 je 68 <dgemm+0x68> # jump if ot %r8!= %rax 20 add $0x1,%esi # register % esi = % esi vmovapd %ymm4,(%r11) # Store %ymm4 ito 4 C elemets 22 vmovapd %ymm3,0x20(%r11) # Store %ymm3 ito 4 C elemets 23 vmovapd %ymm2,0x40(%r11) # Store %ymm2 ito 4 C elemets 24 vmovapd %ymm1,0x60(%r11) # Store %ymm1 ito 4 C elemets 4.12 Istructio-Level Parallelism ad Matrix Multiply Chapter 4 The Processor 127 Performace Impact Chapter 4 The Processor 128

65 Fallacies Pipeliig is easy (!) The basic idea is easy The devil is i the details e.g., detectig data hazards Pipeliig is idepedet of techology So why have t we always doe pipeliig? More trasistors make more advaced techiques feasible Pipelie-related ISA desig eeds to take accout of techology treds e.g., predicated istructios 4.14 Fallacies ad Pitfalls Chapter 4 The Processor 129 Pitfalls Poor ISA desig ca make pipeliig harder e.g., complex istructio sets (VAX, IA-32) Sigificat overhead to make pipeliig work IA-32 micro-op approach e.g., complex addressig modes Register update side effects, memory idirectio e.g., delayed braches Advaced pipelies have log delay slots Chapter 4 The Processor 130

66 Cocludig Remarks ISA iflueces desig of datapath ad cotrol Datapath ad cotrol ifluece desig of ISA Pipeliig improves istructio throughput usig parallelism More istructios completed per secod Latecy for each istructio ot reduced Hazards: structural, data, cotrol Multiple issue ad dyamic schedulig (ILP) Depedecies limit achievable parallelism Complexity leads to the power wall 4.14 Cocludig Remarks Chapter 4 The Processor 131

COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Review Istructio Set Architecture Istructio Set The repertoire of istructios of a computer Differet computers have differet istructio