Digital Logic. Ch. 4 and Appendix C

Size: px

Start display at page:

Download "Digital Logic. Ch. 4 and Appendix C"

Barbara Armstrong
6 years ago
Views:

1 Digital Logic Ch. 4 and Appendix C

2 Gates The most obvious gates are AND and OR We can combine them to implement any logic function

3 Conventions Zero volts is logic 0 5 volt is logic 1 Unless we use negative logic Most computers use smaller voltages now 1.5 volt is used by DDR3 memories In this case 1.5 volt is logic 1 Dues to electrical noise the logic levels are define by a range.

4 Other gates The little circle means not NOR gate (not OR) NAND gate WHAT??? WHAT???

5 Truth Tables It is the opposite of an AND gate It is a NAND gate

6 Example Try to figure out what this does It is a one bit adder with carry in.

7 Simpler Drawing

8 Programmable Logic Arrays PLA for short The dots are really fuses inside a chip Fuses can be programmed once Can implement any logic function Modern fuses are programed many times PLAs on hormones are called Field Programmable Gate Arrays (FPGA)

9 PLAs AND gate array OR gate array

10 Standard Components Decoders Multiplexers ROM

11 Decoder

12 Multiplexer

13 ROM

14 Boolean Algebra Laws Identity Law: A+0=A, A*1=A Zero & One Law: A+1=1, A*0=0 Existence of inverse: A+A' = 1, A*A' = 0 Commutative Law: A+B=B+A, A*B=B*A Associative Law: A+(B+C)=(A+B)+C A*(B*C)=(A*B)*C Distributive Law: A*(B+C)=A*B+A*C A+(B*C)=(A+B)*(A+C)

15 De Morgan's Law (A+B)' = A' * B' (A*B)' = A' + B' Principle of Duality AND and OR are symmetric So is 0 and 1

16 Optimization Two different logic expressions can have exactly the same behavior. Two different expressions with identical behavior may have different cost of implementation Choosing the cheapest is optimization May have to satisfy other criteria Propagation delay, no glitches, etc

17 Optimization AB + AB' =A(B+B') =A*1=A A'B'C + ABC = (A'B' + AB)C =( (A'B' + A)(A'B' + B) )C ( B' + A)(A'+B)C

18 Half Adder S = A'B + AB' C = AB AB SC

19 Full Adder S = A'B'C + AB'C' + A'B'C + ABC Cout = ABC + A'BC + ABC' + AB'C C out = AB + BC + CA (optimized) ABC SC

20 Verilog A hardware description language Can be used to design, optimize and simulate hardware Started in the mid 80's as a hardware simulation system Hardware synthesis was added later Its main competitor is VHDL

21 What can Verilog do? Describe a circuit for simulation purposes Many of the Verilog constructs can be synthesizeable. Allows the designer to specify Behavior and/or Structure

22 Structure of a Verilog Module Contains initial constructs Parallel blocks called always constructs Continuous assignments to specify combinational circuits (gates w/o memory) Instances of other modules

23 Elements of Verilog Wire: mathematical abstraction of a real wire Can have 4 possible values!! True or 1 False or 0 X: unknown (not yet defined, unconnected etc) Z: high impedance Electrically disconnected. A smart trick electronics engineers have invented.

24 Elements of Verilog Registers (reg) Are memory elements Verilog compiler may map them to actual memory elements (flip flops) Same set of possible values

25 Elements of Verilog Constants Can be specified as plain constants like 3, 15, Often we want to specify the bit width of a constant 4'b0011 is 4 bit representation of 3 5'b00011 is a 5 bit representation of 3 4'b0011 is 4 bit representation of 3 (2's compl.) 4'hF is 4 bit representation of 15

26 Operators in Verilog +,,*,/ like C &,, ~, ^ again like C ==,!=, <, >, <=, >= like C <<, >> like C con?expr1:expr2 like C

27 Operators in Verilog But adds to C Unary &,, ^ Apply the operator on all bits of the operand {A,B} the bits of A followed by the bits of B {x{const}} is {const,const... x times}

28 Combinational Circuits A network of gates Directed graph There should be no cycles Output determined exclusively by inputs Implement logic functions

29 Memory elements

30 Memory Elements We can think of memory elements as combinational circuits with feedback We would rather think of them as little black boxes Sometimes memory is implemented using other technologies (capacitors for DRAM)

31 Combinational Circuits Module half_adder(a,b,sum,carry); input A,B; output Sum, Carry; assign Sum = A^B; assign Carry = A & &; endmodule

32 Combinational Circuits Use the assign keyword They represent permanent connections The assign keyword can specify only combinational circuits Combinational circuits can be specified with the always construct as well The always construct can also specify sequential circuits as well

33 The always construct Module half_adder(a,b,sum,carry) input A,B; output reg S, C begin case ({A,B}) 2'b00: begin S=0; C=0; end; 2'b01: begin S=1; C=0; end; 2'b10: begin S=1; C=0; end; 2'b11: begin S=0; C=1; end; end endmodule

34 Combinational with always Previous example used always to implement a half adder Uses blocking assignments Pretty much the same as C If properly defined, most compilers will not use flip flops to implement it If all input signals are on sensitivity list Every execution path assigns value to the same bits

35 Sequential Circuits Any circuit that contains memory If it contains memory then it has state If it has state then the state changes, so it goes through a sequence of states Hence the name sequential.

36 Sequential Circuits

37 Sequential Circuits How come signals don't rush around the loop uncontrollably? This is where the clock comes in It is the same clock you see on the specs of your CPU With every clock pulse the signal goes around once These are called synchronous sequential circuits There are also asynchronous

38 Typical Latch

39 Still... Unless the width of the clock pulse is wisely selected... The signal will travel around more than once These latches are useful in some case, but not good enough for our current task

40 Falling edge trigger FF

41 Edge triggered D Flip Flop Module DFF(clock,D,Q,Qb) input clock, D; output reg Q; output Qb; assign Qb = ~Q; clock) Q <= D; endmodule

42 Timings Timing is complex We use a simplified model Setup time: time the input to the FF has to be stable before the clock edge Hold time: time the input has to be stable after the clock edge

43 Multibit Wires and Registers reg [31:0] rega; rega[0] is the LSB; wire [31:0] ALUout; reg [31:0] regfile[0:31]; regfile[0] is the first register in the register file.

44 MIPS ALU module MIPSALU (ALUctl, A, B, ALUOut, Zero); input [3:0] ALUctl; input [31:0] A,B; output reg [31:0] ALUOut; output Zero; assign Zero = (ALUOut==0); //Zero is true if ALUOut is 0 A, B) begin //reevaluate if these change case (ALUctl) 0: ALUOut = A & B; 1: ALUOut = A B; 2: ALUOut = A + B; 6: ALUOut = A B; 7: ALUOut = A < B? 1 : 0; 12: ALUOut = ~(A B); // result is nor default: ALUOut <= 0; endcase end endmodule

45 Register File

46 Register File: read

47 Register File: write

48 Register File: Verilog module rfile(r1,r2,w,wd,wctl,rd1,rd2,clock) input [5:0] R1,R2,W; input [31:0] WD; input Wctl, clock; output [31:0] RD1,RD2; reg [31:0] RF[31:0]; assign RD1 = RF[R1]; assign RD2 = RF[R2]; clock) if (Wctl) RF[W] <= WD; endmodule

49 Specifying Gates Verilog allows the designer to specify individual gates Can be bulky Similar syntax can be used for user defined modules

50 Half Adder module HA(A,B,S,C) input A, B; output S, C; wire Bn, An, Abn, AnB; not N1(An,A); not N2(Bn,B); and (Abn,A,Bn); and (AnB,An,B); or (S,ABn,AnB); and (C,A,B); endmodule

51 Speeding Up Addition Carry propagation is what slows down addition Sometimes the LSB of input will affect the MSB or the output We design for the worst case senario The simpler adders are called ripple adders

52 Carry LookAhead a0, a1, a2, etc; b0, b1, b2, etc are the inputs c0, c1, c2 are the carries. c1 = b0 c0 + a0 c0 + a0 b0 c1 = a0 b0 + c0 (a0 + b0) c1 = g0 + c0 p0 g0 = a0 b0; p0 = a0 + b0;

53 Carry LookAhead Define g i = a i + b i Then p i = a i b i c i+1 = g i + p i c i

54 Carry LookAhead c1 = g0 + p0 c0 c2 = g1 + p1 g0 + p1 p0 c0 c3 = g2 + p2 g1 + p2 p1 g0 + p2 p1 p0 c0

55 Control Hazards Whenever we have a branch/jump/jal/whatever We find out which way we branch at the MEM stage Meanwhile we have loaded the next three instructions We have to flush the pipeline We waste three cycles

56 What is the problem Jumps/branches are very common 25% of the instructions sometimes If our processor is 4 way superscalar wasting three cycles means we do not execute 12 instructions! Longer pipelines suffer even more

57 Solutions Delayed branch Means that next instruction is always executed Ideally an instruction from before that is independent of the branch An instruction from fall through that has no effect if branch is taken An instruction from the target that has no effect if branch falls through A nop if all else is unavailable We save at most one cycle

58 Solutions Decide the branch at ID stage Requires extra hardware Saves two cycles With branch delay can be stall free

59 Solutions Always predict not taken The easiest... just do what we did so far Fails miserably for loops

60 Solutions Predict taken We can do this at the ID stage Waste one cycle only if prediction correct Combined with delayed branch cost goes to zero (if correct Works fine for many loops

61 Solutions Statically predict taken/not taken Can be done with heuristics Or by giving the compiler an execution trace Just have two variants of every branch instruction Easy to implement Works great for numerical programs Not so great for non numerical

62 Solutions Dynamic prediction The most advanced and most popular Requires a lot of silicon area Can be done by hashing the address to a small memory. (Branch Prediction Buffer) Memory remembers 1 bit (taken/not taken) Loops have two mispredictions Can be solved with two bit prediction There are many far more sophisticated techniques

63 Solutions Speculation The technique nowadays Good when the control hazard is compounded by a data hazard Should allow out of order execution Should provide a way to undo a change after a failed speculation

64 Exceptions/Interrupts There is a difference Exceptions are caused by an internal condition Error, system call Interrupts are caused by external conditions I/O complete, mouse clicks In many cases all are called interrupts They are handled in more or less the same way

65 Why bother A computer that does not communicate with its environment is called a brick The extra hardware to detect and handle interrupts is large and contributes to the slowing down of the clock That's part of the reason why some co processors run so much faster

66 How are they handled The CPU provide relevant info in two registers EPC (Exception Program Counter), 32 bits Cause Register, 32 bits bu many unused Alternatively Use Vector interrupts For each possible cause there is an entry in the vector

67 In more detail Another form of control hazard Instead of branching to a user space address, branch to a kernel space address Branches happen only at a particular stage in the pipeline, but exceptions can happen almost anywhere More than one exception can happen at the same time in different instructions We may need to restart the instruction after the exception is handled Some instructions are handled on the spot, others where they happened.

68 Instruction Level Parallelism What drove the speed of cpus Pipelining is the oldest technique Race to reduce hazards Programmer is unaware of the parallelism The key is multiple issue We encounter hazards on hormones

69 Two kinds Static multiple issue VLIW Fixed form issue packet Was the technique used on Itanium There are usually restriction on what instructions can be packaged together In some designs the compiler has to guarantee no data/structural hazards within the issue packet

70 Extra cost If we allow the issue of an ALU and a memory instruction at the same time we need Twice as many ports on the register file An extra adder to calculate the effective address Ability to detect/forward many more hazards between different issue packets Stalls create twice as much delay

71 Advantage With two issue we have possibly twice as fast processor (if the world was made by angels) We do not need much more hardware With a good compiler the C programmer will never know

72 Disadvantage We have to recompile for new architectures We save a bit on hardware but it is hard to make use of advances immediately Software vendors hated it Itanium is dead.

73 Example: VLIW for MIPS A simplified static multiple issue MIPS like processor Can issue one ALU/branch and one load/store instruction per cycle. Ignores dependencies within the issue packet. Stalls/forwards for dependencies between issue packets

74 Example: VLIW for MIPS ALU/branch IF ID EX M WB Load/Store IF ID EX M WB ALU/branch IF ID EX M WB Load/Store IF ID EX M WB ALU/branch IF ID EX M WB Load/Store IF ID EX M WB ALU/branch IF ID EX M WB Load/Store IF ID EX M WB

75 Example: scheduling code Loop: lw $t0, 0($s1) addu $t0, $t0, $s2 sw $t0, 0($s1) addi $s1, $s1, 4 bne $s1, $0, Loop

76 Scheduled Loop: nop lw $t0, 0($s1) addi $s1, $s1, 4 nop addu $t0, $t0, $s2 nop bne $s1, $0, Loop sw $t0, 0($s1)

77 The Verdict We can do it in 4 issue packets instead of five instructions Before we had one or two stalls so it would take 6 7 cycles to execute, plus the stalls due to the branch If we optimize the single issue version we can get it down to 5 cycles plus branch stalls Now we can execute it in 4 cycles plus branch stalls.

78 Observations We now have many more stalls/nops than single issue The new stalls/nops eat up most of the improvement It is not worth the extra hardware/power consumption Is it the end of the road?

79 Loop unrolling Compiler optimization Can be done easily when loops are independent Sometimes even when they are not independent Reduces the loop overhead Fewer instructions executed Allows more freedom in scheduling Fewer stalls/nops

80 The code Loop: addi $s1,$s1, 16 nop addu $t0, $t0, $s2 addu $t0, $t0, $s2 addu $t0, $t0, $s2 addu $t0, $t0, $s2 nop bne $s1, $0, Loop lw $t0, 0($s1) lw $t1, 12($s1) lw $t2, 8($s1) lw $t3, 8($s1) sw $t0, 8($s1) sw $t1, 8($s1) sw $t2, 8($s1) sw $t3, 4($s1)

81 The tricks we used Unroll the loop, eliminate the branches, simplify loop variable updating Use more temp registers This is called register renaming We need to do it if we have anti dependence or name dependence We may run out of registers or need more saving/restoring Longer code May not be optimal in all architectures

82 Dynamic Multiple Issue A.K.A superscalars The processor decides if it going to issue 0, 1, 2... instructions Instructions are allowed to execute out of order But not necessarily complete out of order The processor decides how many to instructions to issue The compiler does not need to know.

83 Dynamic Pipeline scheduling lw $t0, 20($s2) addu $t1, $t0, $t2 sub $s4, $s4, $t3 slti $t5, $s4, 20 The sub instruction can execute before addu

84 Dynamic Pipeline IF/ID Reservation station Reservation station... Reservation station Exec unit Exec unit Exec unit Commit Unit

85 The bad news Dynamic multiple issue CPUs were available for decades Some can issue more than 4 instructions per cycle They rarely complete more than 2 per cycle on average Have to be conservative to maintain correctness (pointer aliasing)

86 Power Efficiency Power has emerged as the limiting factor Cost of energy goes up Huge server farms are common Ability to eliminate heat is limited Battery life is very important Environmental concerns

87 Fallacies and Pitfalls Pipelining is easy Real pipelining is quite complex Pipelining is independent of technology The huge number of transistors offer options that annul previous technologies (huge pipelines vs delayed branches) Some optimizations in ISA spoil the speed of the pipeline.

Processor (IV) - advanced ILP. Hwansoo Han

Processor (IV) - advanced ILP. Hwansoo Han Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle