Non-pipelined Multicycle processors

Size: px

Start display at page:

Download "Non-pipelined Multicycle processors"

Edwin Jenkins
5 years ago
Views:

1 Non-pipelined Multicycle processors Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology Code for the lecture is available on the course website under the code tab March 22, L12-1

2 Single-Cycle RISC Processor As an illustrative example, we use a subset of RISC-V 32-bit ISA. Register File 2 read & 1 write ports PC Decode Execute Inst Memory separate Instruction & Data memories Data Memory Datapath (arrows in this diagram) are derived automatically from a high-level rule-based description March 22, L12-2

3 Single Cycle Implementation module mkproc(empty); Reg#(Word) pc <- mkreg(0); RFile2R1W rf <- mkrfile2r1w; MagicMemory imem <- mkmagicmemory; MagicMemory dmem <- mkmagicmemory; instantiate the state rule doproc; let inst <- imem.req(memreq{op:ld, addr:pc, data:?}); let dinst = decode(inst); // dinst fields: itype, alufunc, brfunc, dst, rs1, rs2, imm let rval1 = rf.rd1(dinst.rs1); read the let rval2 = rf.rd2(dinst.rs2); register file let einst = execute(dinst, rval1, rval2, pc); // einst fields: itype, rd, data, addr, nextpc updatestate(einst, pc, rf, dmem); endrule endmodule actions to update the processor state extract the fields produces values needed to update the processor state March 22, L12-3

4 Processor Interface For testing, our processor is connected to a host computer* which can read and write the memory of our processor directly Our processor s memory is preloaded with program and data; it always start at pc=0 When the program terminates it writes a 0 in a predetermined place and stops the simulation If the program hits an illegal or unsupported instruction, it dumps the processor state and stops the simulation Consequently the processor interface has no methods, ie, it s interface is Empty! *In a simulation environment the host computer is the same computer on which the simulator runs March 22, L12-4

5 Understanding generated hardware dinst rs1 rs2 RF rval1 rval2 ALU constant Not all instructions have both rs1 and rs2 fields but there is no harm/cost in reading unused registers; we never use results of undefined fields let rval1 = rf.rd1(dinst.rs1); let rval2 = rf.rd2(dinst.rs2); When the same function is called with two different arguments, a mux is generated automatically How expensive is a mux? Area is proportional to the number of bits March 22, L12-5

6 Understanding generated hardware - continued function ExecInst execute( DecodedInst dinst, Word rval1, Word rval2, Word pc ); // extract from dinst: itype, alufunc, brfunc, imm // initialize einst and its fields: data, nextpc, addr to? case (itype) matches OP: begin data = alu(rval1, rval2, alufunc); nextpc = pc+4; end OPIMM: begin data = alu(rval1, imm, alufunc); nextpc = pc+4; end BRANCH: begin nextpc = alubr(rval1, rval2, brfunc)? pc+imm : pc+4; end LUI: begin data = imm; nextpc = pc+4; end JAL: begin data = pc+4; nextpc = pc+imm; end JALR: begin data = pc+4; nextpc = (rval1+imm) & ~1; end LOAD: begin addr = rval1+imm; nextpc = pc+4; end STORE: begin addr = rval1+imm; data = rval2; endcase endfunction nextpc = pc+4; end // assign to einst; We could use the alu for this addition Reuse alu? March 22, L12-6

7 Reusing combinational logic case (itype) matches OP: data = alu(rval1, rval2, alufunc); OPIMM: data = alu(rval1, imm, alufunc);... The two uses of alu are mutually exclusive, and the BSV compiler/backend tools actually share the same alu circuit. (However, one can t be sure of such things) There are ways of forcing the alu reuse by turning into the method call of a module... Reuse is not necessarily a good idea because it prevents specialization The circuit for pc+4 has a lot fewer gates than the circuit for pc+imm Generally we won t concern ourselves with the sharing of combinational circuits March 22, L12-7

8 Plan Structural hazards and Princeton architecture Realistic req/resp memory system Multicycle processor with a realistic memory Multicycle functional units March 22, L12-8

9 Princeton Architecture instructions and data reside in the same memory Memory instructions can t be executed in one cycle! Register File PC Decode Execute Inst Memory Data Memory Such resource conflicts are known as structural hazards and require multicycle implementation Usually extra registers are required to hold values between cycles March 22, L12-9

10 Princeton Architecture introduce intermediate state state Register File PC f2d Decode Execute Magic Memory Insert f2d register to hold the fetched instruction Every instruction takes two cycles: Fetch followed by Execute A one bit register to record the state of the instruction March 22, L12-10

11 Princeton Architecture Two-cycle typedef enum {Fetch, Execute} State deriving (Bits, Eq); module mkprocprincetontwocycle(empty); // instantiate pc, rf, magic mem Reg#(Word) f2d <- mkregu; Reg#(State) state <- mkreg(fetch); rule dofetch (state == Fetch); let inst <- mem.req(memreq{op: Ld, addr: pc, data:?}); f2d <= inst; state <= Execute; endrule rule doexecute (state == Execute); // decode, execute, updatestate state <= Fetch; endrule endmodule If state is Fetch then fetch the instruction and put it in f2d, and change the state to Execute If state is Execute then execute the instruction in f2d, and change the state to Fetch March 22, L12-11

12 doexecute rule reexamined rule doexecute(state == Execute); let inst = f2d; let dinst = decode(inst); let rval1 = rf.rd1(dinst.rs1); let rval2 = rf.rd2(dinst.rs2); let einst = execute(dinst, rval1, rval2, pc); updatestate(einst, pc, rf, mem); state <= Fetch; endrule Execution of all the instructions except the load and store instructions could be completed in the Fetch-cycle itself, if we wanted (no structural hazard) March 22, L12-12

Princeton Architecture where only memory instructions take two cycles rule dofetch (state == FetchExecute); let inst <- mem.req(memreq{op: Ld, addr: pc, data:?

13 Princeton Architecture where only memory instructions take two cycles rule dofetch (state == FetchExecute); let inst <- mem.req(memreq{op: Ld, addr: pc, data:?}); let dinst = decode(inst); let rval1 = rf.rd1(dinst.rs1); //similarly rval2 let einst = execute(dinst, rval1, rval2, pc); if (einst.itype==ld) //Load or Store... begin (einst.itype==st) begin e2m <= einst; state <= MemoryAccess; end else begin if (isvalid(rd)) rf.wr(frommaybe(?,rd), data); Save the executed instruction state in e2m register pc <= einst.nextpc; No memory operation state <= FetchExecute; end endrule rule domemoryaccess (state == MemoryAccess); updatestate(e2m, pc, rf, mem) state <= FetchExecute; endrule March 22, L12-13

14 Performance implications Suppose f fraction of N executed instructions are memory access instructions Two-cycle Princeton architecture will take 2N cycles How many cycles will the variable-cycle Princeton architecture take? f*2*n + (1-f)*N = (1+f)*N However, cycle counts in not the whole story as far as the performance is concerned More to come later March 22, L12-14

15 req resp Realistic Memory Interface Request/Response methods op address data(store) en rdy memory data (load) en rdy No response for Stores; Load responses come back in the requested order interface Memory; method Action req(memreq req); method ActionValue#(Word) resp(); endinterface typedef struct {MemOp op; Word addr; Word data;} MemReq deriving(bits, Eq); typedef enum {Ld, St} MemOp deriving(bits, Eq); m.req(memreq{op:ld, addr:a, data:?}); m.req(memreq{op:st, addr:a, data:v}); let data <- m.resp(); March 22, L12-15

16 req resp Princeton architecture with a realistic memory state Register File PC Decode Execute Memory With request/response memory even instruction fetch cannot be completed in one cycle Instruction fetch must be split into two rules send request and receive response Need registers to hold the state of a partially executed instruction March 22, L12-16

17 Processor with realistic memory module mkprocprincetonmulticycle(empty); // instantiate registers to hold the state of a partially executed instruction rule dofetch(state == Fetch); // initiate instruction fetch; go to Execute rule doexecute(state == Execute); // execute all instructions except memory instructions; go to Fetch // initiate memory access; go to LoadWait rule doloadwait(state == LoadWait); // wait for the load value, update rf, go to Fetch endmodule Lab 5 March 22, L12-17

18 req resp Multicycle ALU s multicycle or floating point ALU operations state Register File PC Decode Execute Memory Multicycle ALU s can be viewed as request/response modules Instructions can be further classified after decoding as simple 1 cycle, multicycle (e.g., multiply) or memory access March 22, L12-18

19 Processor with realistic memory and multicycle ALUs module mkprocmulticycle(empty); // instantiate registers to hold the state of a partially executed instruction rule dofetch(state == Fetch); //initiate instruction fetch; go to Execute rule doexecute(state == Execute); // execute all instructions except memory and multicycle instructions; go to Fetch // initiate memory access; go to LoadWait // initiate multicycle instruction; go to MCWait rule doloadwait(state == LoadWait); // wait for the load value, update rf, go to Fetch rule domcwait(state == MCWait); // wait for MC value, update rf, go to Fetch endmodule Lab 5 March 22, L12-19

20 req resp Clock speed state Register File PC Decode Execute Memory Clock speed depends upon the longest combinational path between two state elements. Thus, in a single cycle implementation t Clock > t M + t DEC + t RF + t ALU + t M + t WB Clock in a two-cycle implementation may be faster t Clock > max {t M, (t DEC + t RF + t ALU + t M + t WB )} However, this may not improve the performance because now some instructions will take two cycles to execute March 22, L12-20

21 Cycle counts Different instructions take different number of cycles in our designs Depending upon the type of opcode, an instruction has to go through 1 to 4 processor-rule firings The number of cycles between processor-rule firings is indeterminate and depends on how quickly the memory responds or a multicycle functional unit completes its work for the input data Next we will study how memory systems are organized internally March 22, L12-21

Multicycle processors. Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology

Multicycle processors. Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology Multicycle processors Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology L12-1 Single-Cycle RISC Processor As an illustrative example, we use a subset of RISC-V