ECE 2300 Digital Logic & Computer Organization. More Caches Measuring Performance

Size: px

Start display at page:

Download "ECE 2300 Digital Logic & Computer Organization. More Caches Measuring Performance"

Dale Greer
5 years ago
Views:

1 ECE 23 Digital Logic & Computer Organization Spring 28 More s Measuring Performance

2 Announcements HW7 due tomorrow :59pm Prelab 5(c) due Saturday 3pm Lab 6 (last one) released HW8 (last one) to be released tonight 2

3 Another LR Replacement Example 2-way set associative Block address Hit/miss index miss Mem[] 4 miss Mem[] (*) Mem[4] 2 miss Mem[2] Mem[4] (*) 6 miss Mem[2] (*) Mem[6] 8 miss Mem[8] Mem[6] (*) miss Mem[8] (*) Mem[] 4 miss Mem[4] Mem[] (*) 2 miss Mem[4] (*) Mem[2] 6 miss Mem[6] Mem[2] (*) 8 miss Mem[6] (*) Mem[8] 2 miss Mem[2] Mem[8] (*) 6 miss Mem[2] (*) Mem[6] 2 hit Mem[2] Mem[6] (*) miss Mem[2] (*) Mem[] (*) = LR block bit in this case contents after access Set Set Color code: Cold miss Conflict miss Capacity miss 3

4 What About Writes? Where do we put the result of a store? hit (block is in cache) Write new data value to the cache Also write to memory (write through) Don t write to memory (write back) Requires an additional dirty bit for each cache block Writes back to memory when a dirty cache block is evicted miss (block is not in cache) Allocate the line (bring it into the cache) (write allocate) Write to memory without allocation (no write allocate or write around) 4

5 Write Through Example Assume write allocate Size of each block is 8 bytes holds 2 blocks holds 8 blocks address V tag data 2 tag bits 3 byte offset bits index bit 5

6 Write Through M[] <= R M[] <= R M[] <= M[] <= R R miss V tag data

7 Write Through M[] <= R M[] <= R M[] <= M[] <= R R miss V tag data

8 Write Through M[] <= R M[] <= R M[] <= M[] <= R R miss V tag data

9 Write Through M[] <= R M[] <= R M[] <= M[] <= R R hit V tag data

10 Write Through M[] <= R M[] <= R M[] <= M[] <= R R hit V tag data

11 Write Through M[] <= R M[] <= R M[] <= M[] <= R R miss V tag data

12 Write Through M[] <= R M[] <= R M[] <= M[] <= R R miss V tag data

13 Write Through M[] <= R M[] <= R M[] <= M[] <= R R miss V tag data

14 Write Through M[] <= R M[] <= R M[] <= M[] <= R R miss V tag data

15 Write Through M[] <= R M[] <= R M[] <= M[] <= R R miss V tag data

16 Write Through M[] <= R M[] <= R M[] <= M[] <= R R miss V tag data

17 Write Back Example Assume write allocate Size of each block is 8 bytes holds 2 blocks holds 8 blocks address Dirty bit V D tag data 2 tag bits 3 byte offset bits index bit 7

18 Write Back M[] <= R M[] <= R M[] <= M[] <= R R miss V D tag data

19 Write Back M[] <= R M[] <= R M[] <= M[] <= R R miss V D tag data

20 Write Back M[] <= R M[] <= R M[] <= M[] <= R R miss V D tag data

21 Write Back M[] <= R M[] <= R M[] <= M[] <= R R hit V D tag data

22 Write Back M[] <= R M[] <= R M[] <= M[] <= R R hit V D tag data

23 Write Back M[] <= R M[] <= R M[] <= M[] <= R R miss V D tag data

24 Write Back M[] <= R M[] <= R M[] <= M[] <= R R miss V D tag data

25 Write Back M[] <= R M[] <= R M[] <= M[] <= R R miss V D tag data

26 Write Back M[] <= R M[] <= R M[] <= M[] <= R R miss V D tag data

27 Write Back M[] <= R M[] <= R M[] <= M[] <= R R miss V D tag data

28 Write Back M[] <= R M[] <= R M[] <= M[] <= R R miss V D tag data

29 Write Back M[] <= R M[] <= R M[] <= M[] <= R R miss V D tag data

30 Hierarchy Time to get a block from memory is so long that performance suffers even with a low miss rate Example: 3% miss rate, cycles to main memory.3 = 3 extra cycles on average to access instructions or data Solution: Add another level of cache 3

31 Pipeline with a Hierarchy M X PCJ P C PCL +2 L Instr (KB) Decoder Adder LD SA SB DR RF D_in M X M X M X M X M X MB F m F AL L Data (KB) D_IN M X MD SE IF/ID ID/EX EX/MEM MEM/WB L2 (MB) Main (GB) 3

32 Hierarchy Level (L) instruction and data caches Small, but very fast Level 2 (L2) cache handles L misses Larger and slower than L, but much faster than main memory L data are also present in L2 Main memory handles L2 cache misses Example: assume cycle to access L (3% miss rate), cycles to L2, % L2 miss rate, cycles to main memory How many cycles on average for instruction/data access? +.3 ( +. ) =.6 cycles 32

33 How Do We Measure Performance? Execution time: The time between the start and completion of a program (or task) Throughput: Total amount of work done in a given time Improving performance means Reducing execution time, or Increasing throughput 33

34 CP Execution Time Amount of time the CP takes to run a program Derivation number of instructions in the program average number of cycles per instruction clock cycle time (/frequency) 34

35 Instruction Count (I) Total number of instructions in the given program Factors Instruction set Mix of instructions chosen by the compiler 35

36 Cycle Time (CT) Clock period (/frequency) Factors Instruction set Structure of the processor and memory hierarchy 36

37 Cycles Per Instruction (CPI) Average number of cycles required to execute each instruction Factors Instruction set Mix of instructions chosen by the compiler Ordering of the instructions by the compiler Structure of the processor and memory hierarchy 37

38 Organization Impact on CPI (Example ) CC CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 ADD R,, IM Reg A L DM Reg OR R4,R, IM Reg A L DM Reg SB R5,,R IM Reg A L DM Reg AND R6,R, IM Reg A L DM Reg ADDI R7,R7,3 IM Reg A L DM Reg With forwarding: Reduced stall cycles Lower CPI, potentially reduced execution time 38

39 Organization Impact on CPI (Example 2) C Control Signals sign bit =? M X PCJ P C PCL +2 Inst RAM Decoder Adder LD SA SB DR RF D_in M X M X M X M X M X MB F m F AL Data RAM D_IN MW M X MD SE IF/ID ID/EX EX/MEM MEM/WB Only one delay slot needed with branch resolved in ID Lower CPI 39

40 Compiler Impact on CPI (Example 3) CC CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 BEQ,,X IM Reg A L DM Reg NOP ADDI R7,R7,3 IM Reg A L DM Reg OR R4,R, IM Reg A L DM Reg SB R5,,R IM Reg A L DM Reg X: AND R6,R, IM Reg A L DM Reg ADDI R7,R7,3... Filling the branch delay slot with a useful instruction 4

41 A Rough Breakdown of CPI CPI base is the base CPI in an ideal scenario where instruction fetches and data memory accesses incur no extra delay CPI memhier is the (additional) CPI spent for accessing the memory hierarchy when a miss occurs in caches CPI total is the overall CPI CPI total = CPI base + CPI memhier 4

42 Impact of L s With L caches L instruction cache miss rate = 2% L data cache miss rate = 4% Miss penalty = cycles (access main memory) 2% of all instructions are loads, % are stores CPI memhier = =

43 Impact of L+L2 s With L and L2 caches L instruction cache miss rate = 2% L data cache miss rate = 4% L2 access time = 5 cycles L2 miss rate = 25% L2 miss penalty = cycles (access main memory) 2% of all instructions are loads, % are stores CPImemhier =.2 ( ) ( ) =.28 43

44 Before Next Class H&H 8.4 Next Time Virtual 44

ECE 2300 Digital Logic & Computer Organization. Caches

ECE 2300 Digital Logic & Computer Organization. Caches ECE 23 Digital Logic & Computer Organization Spring 217 s Lecture 2: 1 Announcements HW7 will be posted tonight Lab sessions resume next week Lecture 2: 2 Course Content Binary numbers and logic gates