administrivia final hour exam next Wednesday covers assembly language like hw and worksheets
|
|
- Alaina Hawkins
- 6 years ago
- Views:
Transcription
1 administrivia final hour exam next Wednesday covers assembly language like hw and worksheets today last worksheet start looking at more details on hardware not covered on ANY exam probably won t finish these slides today any questions on assignment? 1
2 more architecture remember how cpu executes instructions? multiple simple steps... 2
3 CPU (logical) Decode ALU Mem Buffer eax ebx ecx edx... EBP SP PC Memory 3
4 CPU (logical) Decode ALU Mem Buffer eax ebx ecx edx... EBP SP PC add %eax,%ebx Memory 3
5 CPU (logical) PHASES eax 3 ebx 4 ecx 8 edx 5 ALU... EBP SP PC Decode Mem Buffer add %eax,%ebx Memory 3
6 CPU (logical) PHASES eax 3 ebx 4 FETCH ecx 8 edx 5 ALU... EBP SP PC Decode Mem Buffer add %eax,%ebx Memory 3
7 CPU (logical) PHASES eax 3 ebx ebx ebx 4 FETCH eax ecx 8 DECODE + edx 5 ALU... EBP SP PC Decode Mem Buffer add %eax,%ebx Memory 3
8 CPU (logical) 4 3 PHASES eax 3 ebx ebx ebx 4 FETCH eax ecx 8 + edx 5 DECODE ALU... OPFETCH EBP SP PC Decode Mem Buffer add %eax,%ebx Memory 3
9 CPU (logical) 4 3 PHASES eax 3 ebx ebx ebx 4 FETCH eax ecx 8 + edx 5 DECODE ALU... OPFETCH EBP EXECUTE SP PC Decode 7 Mem Buffer add %eax,%ebx Memory 3
10 CPU (logical) 4 3 PHASES eax 3 ebx ebx ebx 47 FETCH eax ecx 8 + edx 5 DECODE ALU... OPFETCH EBP EXECUTE SP PC WRITEBACK Decode 7 Mem Buffer add %eax,%ebx Memory 3
11 computer performance modern processor runs at multiple GHz billions of cycles per second that says the clock cycle < ns less than a billionth of second even silicon cannot do much in that time only executes one step per cycle multiple cycles to execute one instruction 4
12 overall performance on the other hand processor does MORE than one add per cycle 5
13 overall performance on the other hand processor does MORE than one add per cycle doesn t that contradict previous slide? 5
14 overall performance on the other hand processor does MORE than one add per cycle doesn t that contradict previous slide? no because computer designers are clever 5
15 overlapping instructions one set of transistors can only do one thing in one cycle but cpu has LOTS of transistors can do lots of things at once work on multiple instructions at once 6
16 washing consider doing wash with 1 washer/1 dryer if each takes 45 minutes takes 1.5 hours to do 1 load maybe 2 hours if you count pre-treating/sorting and folding/hanging does not take 6 hours to do 3 loads! 7
17 overlap washing steps takes 2 hours for first load to be done each extra load only takes 45 minutes more if you had 1000 loads would think of it as taking 45 minutes per load 8
18 overlap washing steps takes 2 hours for first load to be done each extra load only takes 45 minutes more if you had 1000 loads would think of it as taking 45 minutes per load and would really hate laundry! 8
19 code example consider the code movl %edx,%ecx sarl $4,%eax addl %ebx,%ecx subl %edx,%eax 9
20 code example first instruction must execute movl %edx,%ecx FET DEC OPF EXEC WB 10
21 code example second instruction can start soon after movl %edx,%ecx FET sarl $4,%eax never competition for same transistors 11
22 code example second instruction can start soon after movl %edx,%ecx FET DEC sarl $4,%eax FET never competition for same transistors 11
23 code example second instruction can start soon after movl %edx,%ecx FET DEC OPF sarl $4,%eax FET DEC never competition for same transistors 11
24 code example second instruction can start soon after movl %edx,%ecx FET DEC OPF EXEC sarl $4,%eax FET DEC OPF never competition for same transistors 11
25 code example second instruction can start soon after movl %edx,%ecx FET DEC OPF EXEC WB sarl $4,%eax FET DEC OPF EXEC never competition for same transistors 11
26 code example second instruction can start soon after movl %edx,%ecx FET DEC OPF EXEC WB sarl $4,%eax FET DEC OPF EXEC never competition for same transistors WB 11
27 example third instruction follows suit movl %edx,%ecx FET sarl $4,%eax addl %ebx,%ecx 12
28 example third instruction follows suit movl %edx,%ecx FET DEC sarl $4,%eax FET addl %ebx,%ecx 12
29 example third instruction follows suit movl %edx,%ecx FET DEC OPF sarl $4,%eax FET DEC addl %ebx,%ecx FET 12
30 example third instruction follows suit movl %edx,%ecx FET DEC OPF EXEC sarl $4,%eax FET DEC addl %ebx,%ecx FET OPF DEC 12
31 example third instruction follows suit movl %edx,%ecx FET DEC OPF EXEC WB sarl $4,%eax FET DEC addl %ebx,%ecx FET OPF DEC EXEC OPF 12
32 example third instruction follows suit movl %edx,%ecx FET DEC OPF EXEC WB sarl $4,%eax FET DEC OPF EXEC WB addl %ebx,%ecx FET DEC OPF EXEC 12
33 example third instruction follows suit movl %edx,%ecx FET DEC OPF EXEC WB sarl $4,%eax FET DEC OPF EXEC WB addl %ebx,%ecx FET DEC OPF EXEC WB 12
34 example movl FET Decode movl ALU Mem Buffer eax ebx ecx edx... ebp esp PC Memory 13
35 example movl FET Decode movl ALU Mem Buffer eax ebx ecx edx... ebp esp PC a b c d Memory 13
36 example movl DEC sarl FET ecx edx mov movl Decode ALU Mem Buffer eax ebx ecx edx... ebp esp PC sarl Memory 14
37 example movl DEC sarl FET ecx edx mov movl Decode ALU Mem Buffer eax ebx ecx edx... ebp esp PC a b c d sarl Memory 14
38 example movl OPF sarl DEC adll FET eax 4 eax sar sarl Decode d movl ALU Mem Buffer eax ebx ecx edx... ebp esp PC addl Memory 15
39 example movl OPF sarl DEC adll FET eax 4 eax sar sarl Decode d movl ALU Mem Buffer eax ebx ecx edx... ebp esp PC a b c d addl Memory 15
40 example movl EXEC sarl OPF addl DEC subl OPF ecx ebx ecx add Decode addl subl sarl 4 a ALU movl Mem Buffer eax ebx ecx edx... ebp esp PC Memory 16
41 example movl EXEC sarl OPF addl DEC subl OPF ecx ebx ecx add Decode addl subl sarl 4 a ALU movl Mem Buffer eax ebx ecx edx... ebp esp PC a b c d Memory 16
42 pipelining this overlapping of instructions called pipelining done in all CPUs for last 15 years or so big part of speed up clock speed limited by SLOWEST phase 17
43 hazard anyone see a problem here? movl %edx,%ecx FET DEC OPF EXEC WB sarl $4,%eax FET DEC OPF EXEC WB addl %ebx,%ecx FET DEC OPF EXEC WB 18
44 hazard anyone see a problem here? movl %edx,%ecx FET DEC OPF EXEC WB writes %ecx sarl $4,%eax FET DEC OPF EXEC WB addl %ebx,%ecx FET DEC OPF EXEC WB 18
45 hazard anyone see a problem here? movl %edx,%ecx FET DEC OPF EXEC WB writes %ecx sarl $4,%eax FET DEC addl %ebx,%ecx OPF EXEC WB reads %ecx FET DEC OPF EXEC WB 18
46 example movl WB sarl EXEC addl OPF subl DEC eax edx eax sub subl Decode b addl ALU sarl movl Mem Buffer eax ebx ecx edx... ebp esp PC a b c d Memory 19
47 example movl WB sarl EXEC addl OPF subl DEC eax edx eax sub subl Decode b addl c ALU sarl movl Mem Buffer eax ebx ecx edx... ebp esp PC a b c d Memory 19
48 example movl WB sarl EXEC addl OPF subl DEC eax edx eax sub subl Decode b addl ALU sarl c+d c movl Mem Buffer eax ebx ecx edx... ebp esp PC a b c+d d Memory 19
49 forwarding special hardware in opfetch reads result when needed guarantees correct result 20
50 forwarding movl WB sarl EXEC addl OPF subl DEC eax edx eax sub subl Decode b addl ALU sarl c+d movl Mem Buffer eax ebx ecx edx... ebp esp PC a b c d Memory 21
51 forwarding movl WB sarl EXEC addl OPF subl DEC eax edx eax sub subl Decode b addl ALU sarl c+d c+d movl Mem Buffer eax ebx ecx edx... ebp esp PC a b c d Memory 21
52 forwarding movl WB sarl EXEC addl OPF subl DEC eax edx eax sub subl Decode b addl ALU sarl c+d c+d movl Mem Buffer eax ebx ecx edx... ebp esp PC a b c+d d Memory 21
53 stalls can still stall if execute not finished must wait for value to be computed compiler schedules instructions to avoid these stalls 22
54 stalls code example had no stalls movl %edx,%ecx sarl $4,%eax addl %ebx,%ecx subl %edx,%eax what if reordered? (to more natural ordering) movl %edx,%ecx addl %ebx,%ecx sarl $4,%eax subl %edx,%eax 23
55 stalls code example had no stalls movl %edx,%ecx sarl $4,%eax addl %ebx,%ecx subl %edx,%eax what if reordered? (to more natural ordering) movl %edx,%ecx addl %ebx,%ecx stall on ecx sarl $4,%eax subl %edx,%eax 23
56 stalls code example had no stalls movl %edx,%ecx sarl $4,%eax addl %ebx,%ecx subl %edx,%eax what if reordered? (to more natural ordering) movl %edx,%ecx addl %ebx,%ecx stall on ecx sarl $4,%eax subl %edx,%eax stall on eax 23
57 reducing cycle time can almost always reduce it further break slowest phase into two pieces each takes roughly half the time of the original double clock speed 24
58 RISC vs CISC x86 is classic CISC complex instruction set computer things like cmpl $4096,8(%edx,%eax,4) PowerPC is mainstream RISC reduced instruction set computer only memory access in load/store instructions all operands must be in registers otherwise 4 instructions to do single x86 instruction above 25
59 CISC problems CISC introduces many problems complex instructions take longer cause pipeline cycle to be slower harder to decode more on that in a minute compilers too stupid to use most fancy instrs array accessing is an exception hardware too hard/expensive/flaky to design 26
60 RISC in CISC clothing x86 designers understand this problem x86 core is really RISC no complex instructions all operands in registers no fancy addressing modes decode generates micro-instructions look just like RISC instructions 27
61 micro-instructions look at one from earlier worksheet leal -12(%ebp),%eax incl (%eax) becomes 4 micro-instructions add $12,%ebp,%eax load %eax,regx add $1,REGX,REGX store %eax,regx needs extra register makes decode even harder 28
62 decoding there is a problem with long pipelines short cycle times, but there is a cost long decodes makes in worse original pentium 4 had 9 steps in decode what happens on a branch? 29
63 branches look at code cmpl $2,%eax FET je L1 addl %eax,%edx subl $3,%edx 30
64 branches look at code cmpl $2,%eax FET DEC je L1 FET addl %eax,%edx subl $3,%edx 30
65 branches look at code cmpl $2,%eax FET DEC OPF je L1 FET DEC addl %eax,%edx FET subl $3,%edx 30
66 branches look at code cmpl $2,%eax FET DEC OPF EXEC je L1 FET DEC OPF addl %eax,%edx FET DEC subl $3,%edx FET 30
67 branches look at code cmpl $2,%eax FET DEC OPF EXEC je L1 FET DEC OPF addl %eax,%edx FET DEC subl $3,%edx FET WB EXEC OPF DEC 30
68 branches look at code cmpl $2,%eax FET DEC OPF EXEC je L1 FET DEC OPF addl %eax,%edx FET DEC subl $3,%edx FET WB find new PC EXEC OPF DEC 30
69 branches look at code cmpl $2,%eax FET DEC OPF EXEC je L1 FET DEC OPF addl %eax,%edx FET DEC subl $3,%edx FET WB find new PC EXEC OPF DEC 30
70 branch penalty every branch caused cycle delay almost as bad as memory access all processors use branch prediction guess where branch will go based on previous execution and/or compiler hints no penalty if correct full penalty if wrong current technology right 90+% overall 31
71 superscalar Pentium I was pipelined many more transistors available now what to do with them? how about multiple parts multiple ALU s multiple decoders called superscalar processor 32
72 superscalar modern processor has 2-4 decode pipelines all the same (or almost the same) finish decoding 2-4 instructions per cycle all execute in own ALU in parallel 2-4 times faster if no stalls and branch prediction is perfect makes writing good assembler much harder compilers becoming much more sophisticated 33
73 no superscalar movl FET DEC OPF EXE WB sarl FET DEC OPF EXE WB addl FET DEC OPF EXE WB subl FET DEC OPF EXE WB takes 4 cycles ignoring time to fill pipeline 34
74 superscalar movl FET DEC OPF EXE WB sarl FET DEC OPF EXE WB addl FET DEC OPF EXE WB subl FET DEC OPF EXE WB but now has stalls 35
75 superscalar movl FET DEC OPF EXE WB sarl FET DEC OPF EXE WB addl FET DEC OPF EXE WB subl FET DEC OPF EXE WB but now has stalls 35
76 superscalar movl FET DEC OPF EXE WB sarl FET DEC OPF EXE WB addl FET DEC OPF EXE WB subl FET DEC OPF EXE WB but now has stalls 35
77 superscalar movl FET DEC OPF EXE WB sarl FET DEC OPF EXE WB addl FET DEC stall OPF EXE WB subl FET DEC stall OPF EXE WB with stalls takes 3 cycles after initial pipeline fill 36
78 transistors everywhere moore s law means smaller transistors and each one is faster if all else even, faster transistors = faster cpu and more power hungry cpu fortunately smaller transistors use less power high end processors were eating about 100W and have for more than a decade had been slowly getting worse 37
79 faster or smaller can either use extra transistors to make faster processors make smaller (cheaper) processors intel (et al) want maximum total revenue either more expensive processors or sell more x86 sold mostly for real computers not a high growth market now so need to justify expensive processors 38
80 need for speed way to justify $ is faster processor pipelined early 90 s work on multiple instructions at once broken up by phase of execution 3-4X performance improvement limited by branching longer pipeline faster but worse problems with branching 39
81 need for speed (2) superscalar mid- to late-90 s work on multiple instructions at once same phase adds extra decoders, ALUs,... +/- 50% performance improvement 40
82 (not so) superscalar limited by stalls number of instructions between calc and usage for 5 stage pipeline needs 2-3 extra instructions to not stall 41
83 (not so) superscalar limited by stalls number of instructions between calc and usage for 5 stage pipeline needs 2-3 extra instructions to not stall sarl produces value 41
84 (not so) superscalar limited by stalls number of instructions between calc and usage for 5 stage pipeline needs 2-3 extra instructions to not stall sarl produces value addl uses value 41
85 (not so) superscalar limited by stalls number of instructions between calc and usage for 5 stage pipeline needs 2-3 extra instructions to not stall sarl FET DEC OPF EXE WB addl 41
86 (not so) superscalar limited by stalls number of instructions between calc and usage for 5 stage pipeline needs 2-3 extra instructions to not stall sarl FET DEC OPF EXE WB addl FET DEC OPF EXE WB 41
87 (not so) superscalar limited by stalls number of instructions between calc and usage for 5 stage pipeline needs 2-3 extra instructions to not stall sarl FET DEC OPF EXE WB addl FET DEC OPF EXE WB 41
88 (not so) superscalar limited by stalls number of instructions between calc and usage for 5 stage pipeline needs 2-3 extra instructions to not stall sarl FET DEC OPF EXE WB addl FET DEC OPF EXE WB 41
89 (not so) superscalar limited by stalls number of instructions between calc and usage for 5 stage pipeline needs 2-3 extra instructions to not stall sarl FET DEC OPF EXE WB ext1 FET DEC OPF EXE WB addl 41
90 (not so) superscalar limited by stalls number of instructions between calc and usage for 5 stage pipeline needs 2-3 extra instructions to not stall sarl FET DEC OPF EXE WB ext1 FET DEC OPF EXE WB addl FET DEC OPF EXE WB 41
91 (not so) superscalar limited by stalls number of instructions between calc and usage for 5 stage pipeline needs 2-3 extra instructions to not stall sarl FET DEC OPF EXE WB ext1 FET DEC OPF EXE WB addl FET DEC OPF EXE WB 41
92 (not so) superscalar limited by stalls number of instructions between calc and usage for 5 stage pipeline needs 2-3 extra instructions to not stall sarl FET DEC OPF EXE WB ext1 FET DEC OPF EXE WB addl FET DEC OPF EXE WB 41
93 (not so) superscalar limited by stalls number of instructions between calc and usage for 5 stage pipeline needs 2-3 extra instructions to not stall sarl FET DEC OPF EXE WB ext1 FET DEC OPF EXE WB ext2 FET DEC OPF EXE WB ext3 FET DEC OPF EXE WB addl FET DEC OPF EXE WB 41
94 (not so) superscalar limited by stalls number of instructions between calc and usage for 5 stage pipeline needs 2-3 extra instructions to not stall sarl FET DEC OPF EXE WB ext1 FET DEC OPF EXE WB ext2 FET DEC OPF EXE WB ext3 FET DEC OPF EXE WB addl FET DEC OPF EXE WB forwarding can now work 41
95 out of order compilers try to schedule instructions to avoid stalls most CPUs now allow out of order execution if an instruction stalls one behind it may pass it in line but only if there are no dependencies somewhat controversial takes transistors (= power) could be done by compiler for no power 42
96 ILP both pipelining and superscalar use Instruction Level Parallelism executing multiple instructions in parallel from same program or more precisely, same thread of control little additional improvement there because of structure of typical code 43
97 multi-processing already pushing instructions as fast as possible and executing many instructions at once from a single process only thing left is to execute multiple processes at once called multi-processing 44
98 multi-processing limits multi-processing requires sw support some programs can do multiple things at once multi-threaded programs apache, photoshop,... next gen games starting to be multi-threaded OS can multi-process different programs mp3 player vs mail reader vs eclipse vs... 45
99 on the cheap already have multiple decoders and ALUs can do multiple things at once but a single process does not have enough to execute 2 programs at once, we need second register set including PC basis of Intel HyperThreading and other similar technologies from competitors 46
100 HyperThreading suppose we had 3 decoders, 3 ALUs,... and 2 register sets (including PCs) on average, single process uses 1.5 instrs/cycle if it has 2 decoders, ALUs,... one stalls for a cycle OR both stall every other cycle sharing matches well although sometimes both stall at once or both want 2 at once adding extra stalls could get +/- 2.5 instructions per cycle 47
101 HyperThreading problem both processes share cache competing for that resource as well may not co-exist well tends to work well for many multi-threaded not as well for arbitrary multi-processing in worst case, may be slower than single cache misses are VERY expensive definitely limits gain 48
102 more HT problems stalls are bad for performance but good for power/heat giving parts cycles off gives them a chance to cool hyperthreading works each transistor harder may generate 40% more heat than not also security hole discovered can determine what other thread is doing at least partially from cache changes clever program can determine crypto key 49
103 multiple cores multiple cores replicate entire cpu path from decoder through registers, even caches almost like putting multiple cpu chips in box but fits in one socket also usually shares access to FSB/BSB may aggravate memory bus contention for poorly cached programs 50
104 multi-core advantages multi-core is easy to design just stick 2+ cores on one chip no new work there gets good cooling/power usage can get 2X performance gain for 2 cores assuming two processes waiting to run 51
105 cache coherence problem is with L1 caches now have 2 copies potentially of same data processor A could write address X then processor B could read address X but from 2 different caches B could see wrong answer if not careful called cache coherence problem 52
106 snoopy cache coherence well studied needed for any multi-processing system several approaches defined most common is called snoopy each cache snoops on the others watches r/w to cache works well on single chip multi-processing as long as it does not interfere 53
107 single writer alternative is to allow either many readers of a memory location or a single writer once a cache is written all others invalidate their line only need to check other caches on miss or never if write back cache 54
108 single writer single writer much easier for multi-chip hard to watch cpu/l1 interface at a distance can be implemented so caches own lines when they are writing tracking ownership outside any L1 cache held with first level of shared memory L2 for multi-core main memory in fully distributed called directory protocol in that case 55
109 HT vs multi-core multi-core is clearly superior to HT but costs a lot more in transistors and $ can use both HT more appropriate for multi-threaded programs multi-core used for multi-processing i7 does this look at i7 and gpu s next week 56
administrivia today start assembly probably won t finish all these slides Assignment 4 due tomorrow any questions?
administrivia today start assembly probably won t finish all these slides Assignment 4 due tomorrow any questions? exam on Wednesday today s material not on the exam 1 Assembly Assembly is programming
More informationPipelining. Principles of pipelining Pipeline hazards Remedies. Pre-soak soak soap wash dry wipe. l Chapter 4.4 and 4.5
Pipelining Pre-soak soak soap wash dry wipe Chapter 4.4 and 4.5 Principles of pipelining Pipeline hazards Remedies 1 Multi-stage process Sequential execution One process begins after previous finishes
More informationAdvanced processor designs
Advanced processor designs We ve only scratched the surface of CPU design. Today we ll briefly introduce some of the big ideas and big words behind modern processors by looking at two example CPUs. The
More information3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?
CSE 2021: Computer Organization Single Cycle (Review) Lecture-10b CPU Design : Pipelining-1 Overview, Datapath and control Shakil M. Khan 2 Single Cycle with Jump Multi-Cycle Implementation Instruction:
More informationMore advanced CPUs. August 4, Howard Huang 1
More advanced CPUs In the last two weeks we presented the design of a basic processor. The datapath performs operations on register and memory data. A control unit translates program instructions into
More informationIF1 --> IF2 ID1 ID2 EX1 EX2 ME1 ME2 WB. add $10, $2, $3 IF1 IF2 ID1 ID2 EX1 EX2 ME1 ME2 WB sub $4, $10, $6 IF1 IF2 ID1 ID2 --> EX1 EX2 ME1 ME2 WB
EE 4720 Homework 4 Solution Due: 22 April 2002 To solve Problem 3 and the next assignment a paper has to be read. Do not leave the reading to the last minute, however try attempting the first problem below
More informationMIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14
MIPS Pipelining Computer Organization Architectures for Embedded Computing Wednesday 8 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy 4th Edition, 2011, MK
More informationProcessor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University
Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Moore s Law Gordon Moore @ Intel (1965) 2 Computer Architecture Trends (1)
More informationProcessor Architecture
Processor Architecture Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong (jinkyu@skku.edu)
More informationOutline. What Makes a Good ISA? Programmability. Implementability
Outline Instruction Sets in General MIPS Assembly Programming Other Instruction Sets Goals of ISA Design RISC vs. CISC Intel x86 (IA-32) What Makes a Good ISA? Programmability Easy to express programs
More informationProcessor Architecture
ECPE 170 Jeff Shafer University of the Pacific Processor Architecture 2 Lab Schedule Ac=vi=es Assignments Due Today Wednesday Apr 24 th Processor Architecture Lab 12 due by 11:59pm Wednesday Network Programming
More informationCS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin. School of Information Science and Technology SIST
CS 110 Computer Architecture Pipelining Guest Lecture: Shu Yin http://shtech.org/courses/ca/ School of Information Science and Technology SIST ShanghaiTech University Slides based on UC Berkley's CS61C
More informationEE 4980 Modern Electronic Systems. Processor Advanced
EE 4980 Modern Electronic Systems Processor Advanced Architecture General Purpose Processor User Programmable Intended to run end user selected programs Application Independent PowerPoint, Chrome, Twitter,
More informationECE260: Fundamentals of Computer Engineering
Pipelining James Moscola Dept. of Engineering & Computer Science York College of Pennsylvania Based on Computer Organization and Design, 5th Edition by Patterson & Hennessy What is Pipelining? Pipelining
More informationCompiler Construction D7011E
Compiler Construction D7011E Lecture 8: Introduction to code generation Viktor Leijon Slides largely by Johan Nordlander with material generously provided by Mark P. Jones. 1 What is a Compiler? Compilers
More informationEITF20: Computer Architecture Part2.2.1: Pipeline-1
EITF20: Computer Architecture Part2.2.1: Pipeline-1 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle
More informationPipelining. CSC Friday, November 6, 2015
Pipelining CSC 211.01 Friday, November 6, 2015 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory register file ALU data memory register file Not
More information4) C = 96 * B 5) 1 and 3 only 6) 2 and 4 only
Instructions: The following questions use the AT&T (GNU) syntax for x86-32 assembly code, as in the course notes. Submit your answers to these questions to the Curator as OQ05 by the posted due date and
More informationOutline. What Makes a Good ISA? Programmability. Implementability. Programmability Easy to express programs efficiently?
Outline Instruction Sets in General MIPS Assembly Programming Other Instruction Sets Goals of ISA Design RISC vs. CISC Intel x86 (IA-32) What Makes a Good ISA? Programmability Easy to express programs
More informationPhoto David Wright STEVEN R. BAGLEY PIPELINES AND ILP
Photo David Wright https://www.flickr.com/photos/dhwright/3312563248 STEVEN R. BAGLEY PIPELINES AND ILP INTRODUCTION Been considering what makes the CPU run at a particular speed Spent the last two weeks
More informationWhat is a Compiler? Compiler Construction SMD163. Why Translation is Needed: Know your Target: Lecture 8: Introduction to code generation
Compiler Construction SMD163 Lecture 8: Introduction to code generation Viktor Leijon & Peter Jonsson with slides by Johan Nordlander Contains material generously provided by Mark P. Jones What is a Compiler?
More informationLecture 12. Motivation. Designing for Low Power: Approaches. Architectures for Low Power: Transmeta s Crusoe Processor
Lecture 12 Architectures for Low Power: Transmeta s Crusoe Processor Motivation Exponential performance increase at a low cost However, for some application areas low power consumption is more important
More informationCPSC 313, 04w Term 2 Midterm Exam 2 Solutions
1. (10 marks) Short answers. CPSC 313, 04w Term 2 Midterm Exam 2 Solutions Date: March 11, 2005; Instructor: Mike Feeley 1a. Give an example of one important CISC feature that is normally not part of a
More informationCPU Pipelining Issues
CPU Pipelining Issues What have you been beating your head against? This pipe stuff makes my head hurt! L17 Pipeline Issues & Memory 1 Pipelining Improve performance by increasing instruction throughput
More informationThe Processor: Instruction-Level Parallelism
The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy
More informationVirtual Machines and Dynamic Translation: Implementing ISAs in Software
Virtual Machines and Dynamic Translation: Implementing ISAs in Software Krste Asanovic Laboratory for Computer Science Massachusetts Institute of Technology Software Applications How is a software application
More informationEN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design
EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown
More informationControl Hazards. Branch Prediction
Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional
More informationEITF20: Computer Architecture Part2.2.1: Pipeline-1
EITF20: Computer Architecture Part2.2.1: Pipeline-1 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle
More informationCode Optimization. What is code optimization?
Code Optimization Introduction What is code optimization Processor development Memory development Software design Algorithmic complexity What to optimize How much can we win 1 What is code optimization?
More informationEITF20: Computer Architecture Part2.2.1: Pipeline-1
EITF20: Computer Architecture Part2.2.1: Pipeline-1 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle
More informationLecture 15: Pipelining. Spring 2018 Jason Tang
Lecture 15: Pipelining Spring 2018 Jason Tang 1 Topics Overview of pipelining Pipeline performance Pipeline hazards 2 Sequential Laundry 6 PM 7 8 9 10 11 Midnight Time T a s k O r d e r A B C D 30 40 20
More informationInstruction Level Parallelism
Instruction Level Parallelism Software View of Computer Architecture COMP2 Godfrey van der Linden 200-0-0 Introduction Definition of Instruction Level Parallelism(ILP) Pipelining Hazards & Solutions Dynamic
More informationCSE Lecture 13/14 In Class Handout For all of these problems: HAS NOT CANNOT Add Add Add must wait until $5 written by previous add;
CSE 30321 Lecture 13/14 In Class Handout For the sequence of instructions shown below, show how they would progress through the pipeline. For all of these problems: - Stalls are indicated by placing the
More informationRISC I from Berkeley. 44k Transistors 1Mhz 77mm^2
The Case for RISC RISC I from Berkeley 44k Transistors 1Mhz 77mm^2 2 MIPS: A Classic RISC ISA Instructions 4 bytes (32 bits) 4-byte aligned Instructions operate on memory and registers Memory Data types
More informationCommunications and Computer Engineering II: Lecturer : Tsuyoshi Isshiki
Communications and Computer Engineering II: Microprocessor 2: Processor Micro-Architecture Lecturer : Tsuyoshi Isshiki Dept. Communications and Computer Engineering, Tokyo Institute of Technology isshiki@ict.e.titech.ac.jp
More informationOutline Marquette University
COEN-4710 Computer Hardware Lecture 4 Processor Part 2: Pipelining (Ch.4) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations from Mike
More informationCPE Computer Architecture. Appendix A: Pipelining: Basic and Intermediate Concepts
CPE 110408443 Computer Architecture Appendix A: Pipelining: Basic and Intermediate Concepts Sa ed R. Abed [Computer Engineering Department, Hashemite University] Outline Basic concept of Pipelining The
More informationTi Parallel Computing PIPELINING. Michał Roziecki, Tomáš Cipr
Ti5317000 Parallel Computing PIPELINING Michał Roziecki, Tomáš Cipr 2005-2006 Introduction to pipelining What is this What is pipelining? Pipelining is an implementation technique in which multiple instructions
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationEN164: Design of Computing Systems Lecture 24: Processor / ILP 5
EN164: Design of Computing Systems Lecture 24: Processor / ILP 5 Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationReal Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University
Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel
More informationOrange Coast College. Business Division. Computer Science Department. CS 116- Computer Architecture. Pipelining
Orange Coast College Business Division Computer Science Department CS 116- Computer Architecture Pipelining Recall Pipelining is parallelizing execution Key to speedups in processors Split instruction
More informationEN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)
EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering
More informationReal instruction set architectures. Part 2: a representative sample
Real instruction set architectures Part 2: a representative sample Some historical architectures VAX: Digital s line of midsize computers, dominant in academia in the 70s and 80s Characteristics: Variable-length
More informationParallelism, Multicore, and Synchronization
Parallelism, Multicore, and Synchronization Hakim Weatherspoon CS 3410 Computer Science Cornell University [Weatherspoon, Bala, Bracy, McKee, and Sirer, Roth, Martin] xkcd/619 3 Big Picture: Multicore
More informationPipelining, Instruction Level Parallelism and Memory in Processors. Advanced Topics ICOM 4215 Computer Architecture and Organization Fall 2010
Pipelining, Instruction Level Parallelism and Memory in Processors Advanced Topics ICOM 4215 Computer Architecture and Organization Fall 2010 NOTE: The material for this lecture was taken from several
More informationOverview of the MIPS Architecture: Part I. CS 161: Lecture 0 1/24/17
Overview of the MIPS Architecture: Part I CS 161: Lecture 0 1/24/17 Looking Behind the Curtain of Software The OS sits between hardware and user-level software, providing: Isolation (e.g., to give each
More informationComputer Architecture. Lecture 6.1: Fundamentals of
CS3350B Computer Architecture Winter 2015 Lecture 6.1: Fundamentals of Instructional Level Parallelism Marc Moreno Maza www.csd.uwo.ca/courses/cs3350b [Adapted from lectures on Computer Organization and
More informationPipelining, Branch Prediction, Trends
Pipelining, Branch Prediction, Trends 10.1-10.4 Topics 10.1 Quantitative Analyses of Program Execution 10.2 From CISC to RISC 10.3 Pipelining the Datapath Branch Prediction, Delay Slots 10.4 Overlapping
More informationModern Computer Architecture
Modern Computer Architecture Lecture2 Pipelining: Basic and Intermediate Concepts Hongbin Sun 国家集成电路人才培养基地 Xi an Jiaotong University Pipelining: Its Natural! Laundry Example Ann, Brian, Cathy, Dave each
More informationMulticore and Parallel Processing
Multicore and Parallel Processing Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University P & H Chapter 4.10 11, 7.1 6 xkcd/619 2 Pitfall: Amdahl s Law Execution time after improvement
More informationMidnight Laundry. IC220 Set #19: Laundry, Co-dependency, and other Hazards of Modern (Architecture) Life. Return to Chapter 4
IC220 Set #9: Laundry, Co-dependency, and other Hazards of Modern (Architecture) Life Return to Chapter 4 Midnight Laundry Task order A B C D 6 PM 7 8 9 0 2 2 AM 2 Smarty Laundry Task order A B C D 6 PM
More informationSuperscalar Processors
Superscalar Processors Increasing pipeline length eventually leads to diminishing returns longer pipelines take longer to re-fill data and control hazards lead to increased overheads, removing any a performance
More informationPipelining: Overview. CPSC 252 Computer Organization Ellen Walker, Hiram College
Pipelining: Overview CPSC 252 Computer Organization Ellen Walker, Hiram College Pipelining the Wash Divide into 4 steps: Wash, Dry, Fold, Put Away Perform the steps in parallel Wash 1 Wash 2, Dry 1 Wash
More informationSample Exam I PAC II ANSWERS
Sample Exam I PAC II ANSWERS Please answer questions 1 and 2 on this paper and put all other answers in the blue book. 1. True/False. Please circle the correct response. a. T In the C and assembly calling
More informationMultiple Issue ILP Processors. Summary of discussions
Summary of discussions Multiple Issue ILP Processors ILP processors - VLIW/EPIC, Superscalar Superscalar has hardware logic for extracting parallelism - Solutions for stalls etc. must be provided in hardware
More informationCS 31: Intro to Systems ISAs and Assembly. Martin Gagné Swarthmore College February 7, 2017
CS 31: Intro to Systems ISAs and Assembly Martin Gagné Swarthmore College February 7, 2017 ANNOUNCEMENT All labs will meet in SCI 252 (the robot lab) tomorrow. Overview How to directly interact with hardware
More informationPipeline: Introduction
Pipeline: Introduction These slides are derived from: CSCE430/830 Computer Architecture course by Prof. Hong Jiang and Dave Patterson UCB Some figures and tables have been derived from : Computer System
More informationENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design
ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University
More informationComputer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining
Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining Single-Cycle Design Problems Assuming fixed-period clock every instruction datapath uses one
More informationInstruction-Level Parallelism Dynamic Branch Prediction. Reducing Branch Penalties
Instruction-Level Parallelism Dynamic Branch Prediction CS448 1 Reducing Branch Penalties Last chapter static schemes Move branch calculation earlier in pipeline Static branch prediction Always taken,
More informationMemory Hierarchy, Fully Associative Caches. Instructor: Nick Riasanovsky
Memory Hierarchy, Fully Associative Caches Instructor: Nick Riasanovsky Review Hazards reduce effectiveness of pipelining Cause stalls/bubbles Structural Hazards Conflict in use of datapath component Data
More informationModule 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT
TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012
More informationChapter 4. The Processor
Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified
More informationVon Neumann architecture. The first computers used a single fixed program (like a numeric calculator).
Microprocessors Von Neumann architecture The first computers used a single fixed program (like a numeric calculator). To change the program, one has to re-wire, re-structure, or re-design the computer.
More informationCrusoe Reference. What is Binary Translation. What is so hard about it? Thinking Outside the Box The Transmeta Crusoe Processor
Crusoe Reference Thinking Outside the Box The Transmeta Crusoe Processor 55:132/22C:160 High Performance Computer Architecture The Technology Behind Crusoe Processors--Low-power -Compatible Processors
More informationMultithreaded Processors. Department of Electrical Engineering Stanford University
Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread
More informationPipelining. Parts of these slides are from the support material provided by W. Stallings
Pipelining Raul Queiroz Feitosa Parts of these slides are from the support material provided by W. Stallings Objective To present the Pipelining concept, its limitations and the techniques for performance
More informationComputer Systems Architecture Spring 2016
Computer Systems Architecture Spring 2016 Lecture 01: Introduction Shuai Wang Department of Computer Science and Technology Nanjing University [Adapted from Computer Architecture: A Quantitative Approach,
More informationLecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )
Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target
More informationCS356 Unit 12a. Logic Circuits. Combinational Logic Gates BASIC HW. Processor Hardware Organization Pipelining
2a. 2a.2 CS356 Unit 2a Processor Hardware Organization Pipelining BASIC HW Logic Circuits 2a.3 Combinational Logic Gates 2a.4 logic Performs a specific function (mapping of input combinations to desired
More informationInstruction Set Architecture
CS:APP Chapter 4 Computer Architecture Instruction Set Architecture Randal E. Bryant Carnegie Mellon University http://csapp.cs.cmu.edu CS:APP Instruction Set Architecture Assembly Language View! Processor
More informationInstruction Set Architecture
CS:APP Chapter 4 Computer Architecture Instruction Set Architecture Randal E. Bryant Carnegie Mellon University http://csapp.cs.cmu.edu CS:APP Instruction Set Architecture Assembly Language View Processor
More informationSecond Part of the Course
CSC 2400: Computer Systems Towards the Hardware 1 Second Part of the Course Toward the hardware High-level language (C) assembly language machine language (IA-32) 2 High-Level Language g Make programming
More informationCS311 Lecture: Pipelining and Superscalar Architectures
Objectives: CS311 Lecture: Pipelining and Superscalar Architectures Last revised July 10, 2013 1. To introduce the basic concept of CPU speedup 2. To explain how data and branch hazards arise as a result
More informationWhat is Superscalar? CSCI 4717 Computer Architecture. Why the drive toward Superscalar? What is Superscalar? (continued) In class exercise
CSCI 4717/5717 Computer Architecture Topic: Instruction Level Parallelism Reading: Stallings, Chapter 14 What is Superscalar? A machine designed to improve the performance of the execution of scalar instructions.
More informationCOMPUTER ORGANIZATION AND DESI
COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler
More informationCS61C : Machine Structures
inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures Lecture #22 CPU Design: Pipelining to Improve Performance II 2007-8-1 Scott Beamer, Instructor CS61C L22 CPU Design : Pipelining to Improve Performance
More informationCS 61C: Great Ideas in Computer Architecture. Lecture 13: Pipelining. Krste Asanović & Randy Katz
CS 61C: Great Ideas in Computer Architecture Lecture 13: Pipelining Krste Asanović & Randy Katz http://inst.eecs.berkeley.edu/~cs61c/fa17 RISC-V Pipeline Pipeline Control Hazards Structural Data R-type
More informationCHAPTER 8: CPU and Memory Design, Enhancement, and Implementation
CHAPTER 8: CPU and Memory Design, Enhancement, and Implementation The Architecture of Computer Hardware, Systems Software & Networking: An Information Technology Approach 5th Edition, Irv Englander John
More informationAssembly I: Basic Operations. Jo, Heeseung
Assembly I: Basic Operations Jo, Heeseung Moving Data (1) Moving data: movl source, dest Move 4-byte ("long") word Lots of these in typical code Operand types Immediate: constant integer data - Like C
More informationLecture 40 - x86 Architecture. www-inst.eecs.berkeley.edu/~cs61c/
CS61C Machine Structures Lecture 40 - x86 Architecture 12/5/2007 John Wawrzynek (www.cs.berkeley.edu/~johnw) www-inst.eecs.berkeley.edu/~cs61c/ 1 Outline History of Intel x86 line. MIPS versus x86 Unusual
More informationASSEMBLY I: BASIC OPERATIONS. Jo, Heeseung
ASSEMBLY I: BASIC OPERATIONS Jo, Heeseung MOVING DATA (1) Moving data: movl source, dest Move 4-byte ("long") word Lots of these in typical code Operand types Immediate: constant integer data - Like C
More informationCS 31: Intro to Systems ISAs and Assembly. Kevin Webb Swarthmore College February 9, 2016
CS 31: Intro to Systems ISAs and Assembly Kevin Webb Swarthmore College February 9, 2016 Reading Quiz Overview How to directly interact with hardware Instruction set architecture (ISA) Interface between
More informationRISC & Superscalar. COMP 212 Computer Organization & Architecture. COMP 212 Fall Lecture 12. Instruction Pipeline no hazard.
COMP 212 Computer Organization & Architecture Pipeline Re-Cap Pipeline is ILP -Instruction Level Parallelism COMP 212 Fall 2008 Lecture 12 RISC & Superscalar Divide instruction cycles into stages, overlapped
More informationPipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.
Pipelining Analogy Pipelined laundry: overlapping execution Parallelism improves performance Four loads: Speedup = 8/3.5 = 2.3 Non-stop: Speedup =2n/05n+15 2n/0.5n 1.5 4 = number of stages 4.5 An Overview
More informationCS 31: Intro to Systems ISAs and Assembly. Kevin Webb Swarthmore College September 25, 2018
CS 31: Intro to Systems ISAs and Assembly Kevin Webb Swarthmore College September 25, 2018 Overview How to directly interact with hardware Instruction set architecture (ISA) Interface between programmer
More informationRAČUNALNIŠKEA COMPUTER ARCHITECTURE
RAČUNALNIŠKEA COMPUTER ARCHITECTURE 6 Central Processing Unit - CPU RA - 6 2018, Škraba, Rozman, FRI 6 Central Processing Unit - objectives 6 Central Processing Unit objectives and outcomes: A basic understanding
More informationComplex Pipelines and Branch Prediction
Complex Pipelines and Branch Prediction Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. L22-1 Processor Performance Time Program Instructions Program Cycles Instruction CPI Time Cycle
More informationProcessor Performance and Parallelism Y. K. Malaiya
Processor Performance and Parallelism Y. K. Malaiya Processor Execution time The time taken by a program to execute is the product of n Number of machine instructions executed n Number of clock cycles
More informationFull Datapath. Chapter 4 The Processor 2
Pipelining Full Datapath Chapter 4 The Processor 2 Datapath With Control Chapter 4 The Processor 3 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory
More informationAssembly Language: Overview!
Assembly Language: Overview! 1 Goals of this Lecture! Help you learn:" The basics of computer architecture" The relationship between C and assembly language" IA-32 assembly language, through an example"
More informationFinal Exam Fall 2007
ICS 233 - Computer Architecture & Assembly Language Final Exam Fall 2007 Wednesday, January 23, 2007 7:30 am 10:00 am Computer Engineering Department College of Computer Sciences & Engineering King Fahd
More informationLecture 3. Pipelining. Dr. Soner Onder CS 4431 Michigan Technological University 9/23/2009 1
Lecture 3 Pipelining Dr. Soner Onder CS 4431 Michigan Technological University 9/23/2009 1 A "Typical" RISC ISA 32-bit fixed format instruction (3 formats) 32 32-bit GPR (R0 contains zero, DP take pair)
More informationFull Datapath. Chapter 4 The Processor 2
Pipelining Full Datapath Chapter 4 The Processor 2 Datapath With Control Chapter 4 The Processor 3 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory
More information12.1. CS356 Unit 12. Processor Hardware Organization Pipelining
12.1 CS356 Unit 12 Processor Hardware Organization Pipelining BASIC HW 12.2 Inputs Outputs 12.3 Logic Circuits Combinational logic Performs a specific function (mapping of 2 n input combinations to desired
More informationAdvanced Computer Architecture
Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata What is a Processors Architecture Instruction Set Architecture (ISA) Describes
More informationTutorial 11. Final Exam Review
Tutorial 11 Final Exam Review Introduction Instruction Set Architecture: contract between programmer and designers (e.g.: IA-32, IA-64, X86-64) Computer organization: describe the functional units, cache
More information