administrivia final hour exam next Wednesday covers assembly language like hw and worksheets

Size: px

Start display at page:

Download "administrivia final hour exam next Wednesday covers assembly language like hw and worksheets"

Alaina Hawkins
6 years ago
Views:

1 administrivia final hour exam next Wednesday covers assembly language like hw and worksheets today last worksheet start looking at more details on hardware not covered on ANY exam probably won t finish these slides today any questions on assignment? 1

2 more architecture remember how cpu executes instructions? multiple simple steps... 2

3 CPU (logical) Decode ALU Mem Buffer eax ebx ecx edx... EBP SP PC Memory 3

4 CPU (logical) Decode ALU Mem Buffer eax ebx ecx edx... EBP SP PC add %eax,%ebx Memory 3

5 CPU (logical) PHASES eax 3 ebx 4 ecx 8 edx 5 ALU... EBP SP PC Decode Mem Buffer add %eax,%ebx Memory 3

6 CPU (logical) PHASES eax 3 ebx 4 FETCH ecx 8 edx 5 ALU... EBP SP PC Decode Mem Buffer add %eax,%ebx Memory 3

7 CPU (logical) PHASES eax 3 ebx ebx ebx 4 FETCH eax ecx 8 DECODE + edx 5 ALU... EBP SP PC Decode Mem Buffer add %eax,%ebx Memory 3

8 CPU (logical) 4 3 PHASES eax 3 ebx ebx ebx 4 FETCH eax ecx 8 + edx 5 DECODE ALU... OPFETCH EBP SP PC Decode Mem Buffer add %eax,%ebx Memory 3

9 CPU (logical) 4 3 PHASES eax 3 ebx ebx ebx 4 FETCH eax ecx 8 + edx 5 DECODE ALU... OPFETCH EBP EXECUTE SP PC Decode 7 Mem Buffer add %eax,%ebx Memory 3

10 CPU (logical) 4 3 PHASES eax 3 ebx ebx ebx 47 FETCH eax ecx 8 + edx 5 DECODE ALU... OPFETCH EBP EXECUTE SP PC WRITEBACK Decode 7 Mem Buffer add %eax,%ebx Memory 3

11 computer performance modern processor runs at multiple GHz billions of cycles per second that says the clock cycle < ns less than a billionth of second even silicon cannot do much in that time only executes one step per cycle multiple cycles to execute one instruction 4

12 overall performance on the other hand processor does MORE than one add per cycle 5

13 overall performance on the other hand processor does MORE than one add per cycle doesn t that contradict previous slide? 5

14 overall performance on the other hand processor does MORE than one add per cycle doesn t that contradict previous slide? no because computer designers are clever 5

15 overlapping instructions one set of transistors can only do one thing in one cycle but cpu has LOTS of transistors can do lots of things at once work on multiple instructions at once 6

16 washing consider doing wash with 1 washer/1 dryer if each takes 45 minutes takes 1.5 hours to do 1 load maybe 2 hours if you count pre-treating/sorting and folding/hanging does not take 6 hours to do 3 loads! 7

17 overlap washing steps takes 2 hours for first load to be done each extra load only takes 45 minutes more if you had 1000 loads would think of it as taking 45 minutes per load 8

18 overlap washing steps takes 2 hours for first load to be done each extra load only takes 45 minutes more if you had 1000 loads would think of it as taking 45 minutes per load and would really hate laundry! 8

19 code example consider the code movl %edx,%ecx sarl $4,%eax addl %ebx,%ecx subl %edx,%eax 9

20 code example first instruction must execute movl %edx,%ecx FET DEC OPF EXEC WB 10

21 code example second instruction can start soon after movl %edx,%ecx FET sarl $4,%eax never competition for same transistors 11

22 code example second instruction can start soon after movl %edx,%ecx FET DEC sarl $4,%eax FET never competition for same transistors 11

23 code example second instruction can start soon after movl %edx,%ecx FET DEC OPF sarl $4,%eax FET DEC never competition for same transistors 11

24 code example second instruction can start soon after movl %edx,%ecx FET DEC OPF EXEC sarl $4,%eax FET DEC OPF never competition for same transistors 11

25 code example second instruction can start soon after movl %edx,%ecx FET DEC OPF EXEC WB sarl $4,%eax FET DEC OPF EXEC never competition for same transistors 11

26 code example second instruction can start soon after movl %edx,%ecx FET DEC OPF EXEC WB sarl $4,%eax FET DEC OPF EXEC never competition for same transistors WB 11

27 example third instruction follows suit movl %edx,%ecx FET sarl $4,%eax addl %ebx,%ecx 12

28 example third instruction follows suit movl %edx,%ecx FET DEC sarl $4,%eax FET addl %ebx,%ecx 12

29 example third instruction follows suit movl %edx,%ecx FET DEC OPF sarl $4,%eax FET DEC addl %ebx,%ecx FET 12

30 example third instruction follows suit movl %edx,%ecx FET DEC OPF EXEC sarl $4,%eax FET DEC addl %ebx,%ecx FET OPF DEC 12

31 example third instruction follows suit movl %edx,%ecx FET DEC OPF EXEC WB sarl $4,%eax FET DEC addl %ebx,%ecx FET OPF DEC EXEC OPF 12

32 example third instruction follows suit movl %edx,%ecx FET DEC OPF EXEC WB sarl $4,%eax FET DEC OPF EXEC WB addl %ebx,%ecx FET DEC OPF EXEC 12

33 example third instruction follows suit movl %edx,%ecx FET DEC OPF EXEC WB sarl $4,%eax FET DEC OPF EXEC WB addl %ebx,%ecx FET DEC OPF EXEC WB 12

34 example movl FET Decode movl ALU Mem Buffer eax ebx ecx edx... ebp esp PC Memory 13

35 example movl FET Decode movl ALU Mem Buffer eax ebx ecx edx... ebp esp PC a b c d Memory 13

36 example movl DEC sarl FET ecx edx mov movl Decode ALU Mem Buffer eax ebx ecx edx... ebp esp PC sarl Memory 14

37 example movl DEC sarl FET ecx edx mov movl Decode ALU Mem Buffer eax ebx ecx edx... ebp esp PC a b c d sarl Memory 14

38 example movl OPF sarl DEC adll FET eax 4 eax sar sarl Decode d movl ALU Mem Buffer eax ebx ecx edx... ebp esp PC addl Memory 15

39 example movl OPF sarl DEC adll FET eax 4 eax sar sarl Decode d movl ALU Mem Buffer eax ebx ecx edx... ebp esp PC a b c d addl Memory 15

40 example movl EXEC sarl OPF addl DEC subl OPF ecx ebx ecx add Decode addl subl sarl 4 a ALU movl Mem Buffer eax ebx ecx edx... ebp esp PC Memory 16

41 example movl EXEC sarl OPF addl DEC subl OPF ecx ebx ecx add Decode addl subl sarl 4 a ALU movl Mem Buffer eax ebx ecx edx... ebp esp PC a b c d Memory 16

42 pipelining this overlapping of instructions called pipelining done in all CPUs for last 15 years or so big part of speed up clock speed limited by SLOWEST phase 17

43 hazard anyone see a problem here? movl %edx,%ecx FET DEC OPF EXEC WB sarl $4,%eax FET DEC OPF EXEC WB addl %ebx,%ecx FET DEC OPF EXEC WB 18

44 hazard anyone see a problem here? movl %edx,%ecx FET DEC OPF EXEC WB writes %ecx sarl $4,%eax FET DEC OPF EXEC WB addl %ebx,%ecx FET DEC OPF EXEC WB 18

45 hazard anyone see a problem here? movl %edx,%ecx FET DEC OPF EXEC WB writes %ecx sarl $4,%eax FET DEC addl %ebx,%ecx OPF EXEC WB reads %ecx FET DEC OPF EXEC WB 18

46 example movl WB sarl EXEC addl OPF subl DEC eax edx eax sub subl Decode b addl ALU sarl movl Mem Buffer eax ebx ecx edx... ebp esp PC a b c d Memory 19

47 example movl WB sarl EXEC addl OPF subl DEC eax edx eax sub subl Decode b addl c ALU sarl movl Mem Buffer eax ebx ecx edx... ebp esp PC a b c d Memory 19

48 example movl WB sarl EXEC addl OPF subl DEC eax edx eax sub subl Decode b addl ALU sarl c+d c movl Mem Buffer eax ebx ecx edx... ebp esp PC a b c+d d Memory 19

49 forwarding special hardware in opfetch reads result when needed guarantees correct result 20

50 forwarding movl WB sarl EXEC addl OPF subl DEC eax edx eax sub subl Decode b addl ALU sarl c+d movl Mem Buffer eax ebx ecx edx... ebp esp PC a b c d Memory 21

51 forwarding movl WB sarl EXEC addl OPF subl DEC eax edx eax sub subl Decode b addl ALU sarl c+d c+d movl Mem Buffer eax ebx ecx edx... ebp esp PC a b c d Memory 21

52 forwarding movl WB sarl EXEC addl OPF subl DEC eax edx eax sub subl Decode b addl ALU sarl c+d c+d movl Mem Buffer eax ebx ecx edx... ebp esp PC a b c+d d Memory 21

53 stalls can still stall if execute not finished must wait for value to be computed compiler schedules instructions to avoid these stalls 22

54 stalls code example had no stalls movl %edx,%ecx sarl $4,%eax addl %ebx,%ecx subl %edx,%eax what if reordered? (to more natural ordering) movl %edx,%ecx addl %ebx,%ecx sarl $4,%eax subl %edx,%eax 23

55 stalls code example had no stalls movl %edx,%ecx sarl $4,%eax addl %ebx,%ecx subl %edx,%eax what if reordered? (to more natural ordering) movl %edx,%ecx addl %ebx,%ecx stall on ecx sarl $4,%eax subl %edx,%eax 23

56 stalls code example had no stalls movl %edx,%ecx sarl $4,%eax addl %ebx,%ecx subl %edx,%eax what if reordered? (to more natural ordering) movl %edx,%ecx addl %ebx,%ecx stall on ecx sarl $4,%eax subl %edx,%eax stall on eax 23

57 reducing cycle time can almost always reduce it further break slowest phase into two pieces each takes roughly half the time of the original double clock speed 24

58 RISC vs CISC x86 is classic CISC complex instruction set computer things like cmpl $4096,8(%edx,%eax,4) PowerPC is mainstream RISC reduced instruction set computer only memory access in load/store instructions all operands must be in registers otherwise 4 instructions to do single x86 instruction above 25

59 CISC problems CISC introduces many problems complex instructions take longer cause pipeline cycle to be slower harder to decode more on that in a minute compilers too stupid to use most fancy instrs array accessing is an exception hardware too hard/expensive/flaky to design 26

60 RISC in CISC clothing x86 designers understand this problem x86 core is really RISC no complex instructions all operands in registers no fancy addressing modes decode generates micro-instructions look just like RISC instructions 27

61 micro-instructions look at one from earlier worksheet leal -12(%ebp),%eax incl (%eax) becomes 4 micro-instructions add $12,%ebp,%eax load %eax,regx add $1,REGX,REGX store %eax,regx needs extra register makes decode even harder 28

62 decoding there is a problem with long pipelines short cycle times, but there is a cost long decodes makes in worse original pentium 4 had 9 steps in decode what happens on a branch? 29

63 branches look at code cmpl $2,%eax FET je L1 addl %eax,%edx subl $3,%edx 30

64 branches look at code cmpl $2,%eax FET DEC je L1 FET addl %eax,%edx subl $3,%edx 30

65 branches look at code cmpl $2,%eax FET DEC OPF je L1 FET DEC addl %eax,%edx FET subl $3,%edx 30

66 branches look at code cmpl $2,%eax FET DEC OPF EXEC je L1 FET DEC OPF addl %eax,%edx FET DEC subl $3,%edx FET 30

67 branches look at code cmpl $2,%eax FET DEC OPF EXEC je L1 FET DEC OPF addl %eax,%edx FET DEC subl $3,%edx FET WB EXEC OPF DEC 30

68 branches look at code cmpl $2,%eax FET DEC OPF EXEC je L1 FET DEC OPF addl %eax,%edx FET DEC subl $3,%edx FET WB find new PC EXEC OPF DEC 30

69 branches look at code cmpl $2,%eax FET DEC OPF EXEC je L1 FET DEC OPF addl %eax,%edx FET DEC subl $3,%edx FET WB find new PC EXEC OPF DEC 30

70 branch penalty every branch caused cycle delay almost as bad as memory access all processors use branch prediction guess where branch will go based on previous execution and/or compiler hints no penalty if correct full penalty if wrong current technology right 90+% overall 31

71 superscalar Pentium I was pipelined many more transistors available now what to do with them? how about multiple parts multiple ALU s multiple decoders called superscalar processor 32

72 superscalar modern processor has 2-4 decode pipelines all the same (or almost the same) finish decoding 2-4 instructions per cycle all execute in own ALU in parallel 2-4 times faster if no stalls and branch prediction is perfect makes writing good assembler much harder compilers becoming much more sophisticated 33

73 no superscalar movl FET DEC OPF EXE WB sarl FET DEC OPF EXE WB addl FET DEC OPF EXE WB subl FET DEC OPF EXE WB takes 4 cycles ignoring time to fill pipeline 34

74 superscalar movl FET DEC OPF EXE WB sarl FET DEC OPF EXE WB addl FET DEC OPF EXE WB subl FET DEC OPF EXE WB but now has stalls 35

75 superscalar movl FET DEC OPF EXE WB sarl FET DEC OPF EXE WB addl FET DEC OPF EXE WB subl FET DEC OPF EXE WB but now has stalls 35

76 superscalar movl FET DEC OPF EXE WB sarl FET DEC OPF EXE WB addl FET DEC OPF EXE WB subl FET DEC OPF EXE WB but now has stalls 35

77 superscalar movl FET DEC OPF EXE WB sarl FET DEC OPF EXE WB addl FET DEC stall OPF EXE WB subl FET DEC stall OPF EXE WB with stalls takes 3 cycles after initial pipeline fill 36

78 transistors everywhere moore s law means smaller transistors and each one is faster if all else even, faster transistors = faster cpu and more power hungry cpu fortunately smaller transistors use less power high end processors were eating about 100W and have for more than a decade had been slowly getting worse 37

79 faster or smaller can either use extra transistors to make faster processors make smaller (cheaper) processors intel (et al) want maximum total revenue either more expensive processors or sell more x86 sold mostly for real computers not a high growth market now so need to justify expensive processors 38

80 need for speed way to justify $ is faster processor pipelined early 90 s work on multiple instructions at once broken up by phase of execution 3-4X performance improvement limited by branching longer pipeline faster but worse problems with branching 39

81 need for speed (2) superscalar mid- to late-90 s work on multiple instructions at once same phase adds extra decoders, ALUs,... +/- 50% performance improvement 40

82 (not so) superscalar limited by stalls number of instructions between calc and usage for 5 stage pipeline needs 2-3 extra instructions to not stall 41

83 (not so) superscalar limited by stalls number of instructions between calc and usage for 5 stage pipeline needs 2-3 extra instructions to not stall sarl produces value 41

84 (not so) superscalar limited by stalls number of instructions between calc and usage for 5 stage pipeline needs 2-3 extra instructions to not stall sarl produces value addl uses value 41

85 (not so) superscalar limited by stalls number of instructions between calc and usage for 5 stage pipeline needs 2-3 extra instructions to not stall sarl FET DEC OPF EXE WB addl 41

86 (not so) superscalar limited by stalls number of instructions between calc and usage for 5 stage pipeline needs 2-3 extra instructions to not stall sarl FET DEC OPF EXE WB addl FET DEC OPF EXE WB 41

87 (not so) superscalar limited by stalls number of instructions between calc and usage for 5 stage pipeline needs 2-3 extra instructions to not stall sarl FET DEC OPF EXE WB addl FET DEC OPF EXE WB 41

88 (not so) superscalar limited by stalls number of instructions between calc and usage for 5 stage pipeline needs 2-3 extra instructions to not stall sarl FET DEC OPF EXE WB addl FET DEC OPF EXE WB 41

89 (not so) superscalar limited by stalls number of instructions between calc and usage for 5 stage pipeline needs 2-3 extra instructions to not stall sarl FET DEC OPF EXE WB ext1 FET DEC OPF EXE WB addl 41

90 (not so) superscalar limited by stalls number of instructions between calc and usage for 5 stage pipeline needs 2-3 extra instructions to not stall sarl FET DEC OPF EXE WB ext1 FET DEC OPF EXE WB addl FET DEC OPF EXE WB 41

91 (not so) superscalar limited by stalls number of instructions between calc and usage for 5 stage pipeline needs 2-3 extra instructions to not stall sarl FET DEC OPF EXE WB ext1 FET DEC OPF EXE WB addl FET DEC OPF EXE WB 41

92 (not so) superscalar limited by stalls number of instructions between calc and usage for 5 stage pipeline needs 2-3 extra instructions to not stall sarl FET DEC OPF EXE WB ext1 FET DEC OPF EXE WB addl FET DEC OPF EXE WB 41

93 (not so) superscalar limited by stalls number of instructions between calc and usage for 5 stage pipeline needs 2-3 extra instructions to not stall sarl FET DEC OPF EXE WB ext1 FET DEC OPF EXE WB ext2 FET DEC OPF EXE WB ext3 FET DEC OPF EXE WB addl FET DEC OPF EXE WB 41

94 (not so) superscalar limited by stalls number of instructions between calc and usage for 5 stage pipeline needs 2-3 extra instructions to not stall sarl FET DEC OPF EXE WB ext1 FET DEC OPF EXE WB ext2 FET DEC OPF EXE WB ext3 FET DEC OPF EXE WB addl FET DEC OPF EXE WB forwarding can now work 41

95 out of order compilers try to schedule instructions to avoid stalls most CPUs now allow out of order execution if an instruction stalls one behind it may pass it in line but only if there are no dependencies somewhat controversial takes transistors (= power) could be done by compiler for no power 42

96 ILP both pipelining and superscalar use Instruction Level Parallelism executing multiple instructions in parallel from same program or more precisely, same thread of control little additional improvement there because of structure of typical code 43

97 multi-processing already pushing instructions as fast as possible and executing many instructions at once from a single process only thing left is to execute multiple processes at once called multi-processing 44

98 multi-processing limits multi-processing requires sw support some programs can do multiple things at once multi-threaded programs apache, photoshop,... next gen games starting to be multi-threaded OS can multi-process different programs mp3 player vs mail reader vs eclipse vs... 45

99 on the cheap already have multiple decoders and ALUs can do multiple things at once but a single process does not have enough to execute 2 programs at once, we need second register set including PC basis of Intel HyperThreading and other similar technologies from competitors 46

100 HyperThreading suppose we had 3 decoders, 3 ALUs,... and 2 register sets (including PCs) on average, single process uses 1.5 instrs/cycle if it has 2 decoders, ALUs,... one stalls for a cycle OR both stall every other cycle sharing matches well although sometimes both stall at once or both want 2 at once adding extra stalls could get +/- 2.5 instructions per cycle 47

101 HyperThreading problem both processes share cache competing for that resource as well may not co-exist well tends to work well for many multi-threaded not as well for arbitrary multi-processing in worst case, may be slower than single cache misses are VERY expensive definitely limits gain 48

102 more HT problems stalls are bad for performance but good for power/heat giving parts cycles off gives them a chance to cool hyperthreading works each transistor harder may generate 40% more heat than not also security hole discovered can determine what other thread is doing at least partially from cache changes clever program can determine crypto key 49

103 multiple cores multiple cores replicate entire cpu path from decoder through registers, even caches almost like putting multiple cpu chips in box but fits in one socket also usually shares access to FSB/BSB may aggravate memory bus contention for poorly cached programs 50

104 multi-core advantages multi-core is easy to design just stick 2+ cores on one chip no new work there gets good cooling/power usage can get 2X performance gain for 2 cores assuming two processes waiting to run 51

105 cache coherence problem is with L1 caches now have 2 copies potentially of same data processor A could write address X then processor B could read address X but from 2 different caches B could see wrong answer if not careful called cache coherence problem 52

106 snoopy cache coherence well studied needed for any multi-processing system several approaches defined most common is called snoopy each cache snoops on the others watches r/w to cache works well on single chip multi-processing as long as it does not interfere 53

107 single writer alternative is to allow either many readers of a memory location or a single writer once a cache is written all others invalidate their line only need to check other caches on miss or never if write back cache 54

108 single writer single writer much easier for multi-chip hard to watch cpu/l1 interface at a distance can be implemented so caches own lines when they are writing tracking ownership outside any L1 cache held with first level of shared memory L2 for multi-core main memory in fully distributed called directory protocol in that case 55

109 HT vs multi-core multi-core is clearly superior to HT but costs a lot more in transistors and $ can use both HT more appropriate for multi-threaded programs multi-core used for multi-processing i7 does this look at i7 and gpu s next week 56

administrivia today start assembly probably won t finish all these slides Assignment 4 due tomorrow any questions?

administrivia today start assembly probably won t finish all these slides Assignment 4 due tomorrow any questions? exam on Wednesday today s material not on the exam 1 Assembly Assembly is programming