CoE - ECE 0142 Computer Organization. Instructions: Language of the Computer

CoE - ECE 42 Computer Organization Instructions: Language of the Computer

The Stored Program Concept The stored program concept says that the program is stored with data in the computer s memory. The computer is able to manipulate it as data for example, to load it from disk, move it in memory, and store it back on disk. It is the basic operating principle for every computer. It is so common that it is taken for granted. Without it, every instruction would have to be initiated manually. 2

The Fetch-Execute Cycle Fig..2 3

Machine, Processor, and Memory State The Machine State: contents of all registers in system, accessible to programmer or not The Processor State: registers internal to the CPU The Memory State: contents in the memory system State is used in the formal finite state machine sense Maintaining or restoring the machine and processor state is important to many operations, especially procedure calls and interrupts 4

Instruction set architecture (ISA) Software ISA Hardware 5

MIPS In this class, we ll use the MIPS instruction set architecture (ISA) to illustrate concepts in assembly language and machine organization Of course, the concepts are not MIPS-specific MIPS is just convenient because it is real, yet simple (unlike x86) The MIPS ISA is still used in many places today. Primarily in embedded systems, like: Various routers from Cisco Game machines like the Nintendo 64 and Sony Playstation 2 You must become fluent in MIPS assembly: Translate from C to MIPS and MIPS to C 6

MIPS: register-to-register, three address MIPS is a register-to-register, or load/store, architecture. The destination and sources must all be registers. Special instructions, which we ll see later, are needed to access main memory. MIPS uses three-address instructions for data manipulation. Each ALU instruction contains a destination and two sources. For example, an addition instruction (a = b + c) has the form: operation operands add a, b, c destination sources 7

MIPS register names MIPS register names begin with a $. There are two naming conventions: By number: $ $ $2 $3 By (mostly) two-character names, such as: $a-$a3 $s-$s7 $t-$t9 $sp $ra Not all of the registers are equivalent: E.g., register $ or $zero always contains the value (go ahead, try to change it) Other registers have special uses, by convention: E.g., register $sp is used to hold the stack pointer You have to be a little careful in picking registers for your programs. 8

Policy of Use Conventions Name Register number Usage $zero the constant value $at assembler temporary $v-$v 2-3 values for results and expression evaluation $a-$a3 4-7 arguments $t-$t7 8-5 temporaries $s-$s7 6-23 Saved temporaries $t8-$t9 24-25 more temporaries $k-$k 26-27 reserved for OS kernel $gp 28 global pointer $sp 29 stack pointer $fp 3 frame pointer $ra 3 return address 9

Basic arithmetic and logic operations The basic integer arithmetic operations include the following: add sub mul div And here are a few logical operations: and or xor Remember that these all require three register operands; for example: add $t, $t, $t2 # $t = $t + $t2 xor $s, $s, $a # $s = $s xor $a

Immediate operands The ALU instructions we ve seen so far expect register operands. How do you get data into registers in the first place? Some MIPS instructions allow you to specify a signed constant, or immediate value, for the second source instead of a register. For example, here is the immediate add instruction, addi: addi $t, $t, 4 # $t = $t + 4 Immediate operands can be used in conjunction with the $zero register to write constants into registers: addi $t, $, 4 # $t = 4 Data can also be loaded first into the memory along with the executable file. Then you can use load instructions to put them into registers lw $t, 8($t) # $t = mem[8+$t] MIPS is considered a load/store architecture, because arithmetic operands cannot be from arbitrary memory locations. They must either be registers or constants that are embedded in the instruction.

We need more space memory Registers are fast and convenient, but we have only 32 of them, and each one is just 32-bit wide. That s not enough to hold data structures like large arrays. We also can t access data elements that are wider than 32 bits. We need to add some main memory to the system! RAM is cheaper and denser than registers, so we can add lots of it. But memory is also significantly slower, so registers should be used whenever possible. In the past, using registers wisely was the programmer s job. For example, C has a keyword register that marks commonlyused variables which should be kept in the register file if possible. However, modern compilers do a pretty good job of using registers intelligently and minimizing RAM accesses. 2

Memory review Memory sizes are specified much like register files; here is a 2 k x n bit RAM. k n 2 k n memory ADRS DATA CS WR OUT n CS WR Operation x None selected address Write selected address A chip select input CS enables or disables the RAM. ADRS specifies the memory location to access. WR selects between reading from or writing to the memory. To read from memory, WR should be set to. OUT will be the n- bit value stored at ADRS. To write to memory, we set WR =. DATA is the n-bit value to store in memory. 3

MIPS memory 32 8 2 32 8memory ADRS DATA CS WR OUT 8 MIPS memory is byte-addressable, which means that each memory address references an 8-bit quantity. The MIPS architecture can support up to 32 address lines. This results in a 2 32 x 8 RAM, which would be 4 GB of memory. Not all actual MIPS machines will have this much! 4

Bytes and words: Word = 4 Bytes Remember to be careful with memory addresses when accessing words. For instance, assume an array of words begins at address 2. The first array element is at address 2. The second word is at address 24, not 2. For example, if $a contains 2, then lw $t, ($a) accesses the first word of the array, but lw $t, 8($a) would access the third word of the array, at address 28. 5

Loading and storing bytes The MIPS instruction set includes dedicated load and store instructions for accessing memory. The main difference is that MIPS uses indexed addressing. The address operand specifies a signed constant and a register. These values are added to generate the effective address. The MIPS load byte instruction lb transfers one byte of data from main memory to a register. lb $t, 2($a) # $t = Memory[$a + 2] question: what about the other 3 bytes in $t? Sign extension! The store byte instruction sb transfers the lowest byte of data from a register into main memory. sb $t, 2($a) # Memory[$a + 2] = $t 6

Loading and storing words You can also load or store 32-bit quantities a complete word instead of just a byte with the lw and sw instructions. lw $t, 2($a) # $t = Memory[$a + 2] sw $t, 2($a) # Memory[$a + 2] = $t Most programming languages support several 32-bit data types. Integers Single-precision floating-point numbers Memory addresses, or pointers Unless otherwise stated, we ll assume words are the basic unit of data. 7

Computing with memory So, to compute with memory-based data, you must:. Load the data from memory to the register file. 2. Do the computation, leaving the result in a register. 3. Store that value back to memory if needed. For example, let s say that you wanted to do the same addition, but the values were in memory. How can we do the following using MIPS assembly language using as few registers as possible? char A[4] = {, 2, 3, 4}; int result; result = A[] + A[] + A[2] + A[3]; 8

Memory alignment Keep in mind that memory is byte-addressable, so a 32-bit word actually occupies four contiguous locations (bytes) of main memory. Address 2 3 4 5 6 7 8 9 8-bit data Word Word 2 Word 3 The MIPS architecture requires words to be aligned in memory; 32-bit words must start at an address that is divisible by 4., 4, 8 and 2 are valid word addresses., 2, 3, 5, 6, 7, 9, and are not valid word addresses. Unaligned memory accesses result in a bus error, which you may have unfortunately seen before. This restriction has relatively little effect on high-level languages and compilers, but it makes things easier and faster for the processor. 9

Endianness Endianness is the byte ordering used to store data. Typical cases are the order in which integer values are stored as bytes in memory. Big-endian Little-endian 2

Comparison In Little-endian, the least significant byte goes to the lowest memory address consistent with computer convention In Big-endian, reading bytes from low address to high address is akin to left-to-right reading order in hexadecimal For example, to store a string ABCD In Big-endian: LSB MSB Address 8-bit data 2 3 4 5 6 7 8 9 A B C D In Little-endian Address 2 3 4 5 6 7 8 9 8-bit data D C B A 2

Exercise Can we figure out the code? swap(int v[], int k); { int temp; } temp = v[k] v[k] = v[k+]; v[k+] = temp; Assuming k is stored in $5, and the starting address of v[] is in $4. swap: ; $5=k $4=v[] sll $2, $5, 2; $2 k 4 add $2, $4, $2; $2 v[k] lw $5, ($2) ; $5 v[k] lw $6, 4($2) ; $6 v[k+] sw $6, ($2) ; v[k] $6 sw $5, 4($2) ; v[k+] $5 jr $3 22

Pseudo-instructions MIPS assemblers support pseudo-instructions that give the illusion of a more expressive instruction set, but are actually translated into one or more simpler, real instructions. For example, you can use the li and move pseudo-instructions: li $a, 2 # Load immediate 2 into $a move $a, $t # Copy $t into $a They are probably clearer than their corresponding MIPS instructions: addi $a, $, 2 # Initialize $a to 2 add $a, $t, $ # Copy $t into $a We ll see lots more pseudo-instructions this semester. A core instruction set is given in Green Card of the text (J. Hennessy and D. Patterson, st page). Unless otherwise stated, you can always use pseudo-instructions in your assignments and on exams. 23

Control flow in high-level languages The instructions in a program usually execute one after another, but it s often necessary to alter the normal control flow. Conditional statements execute only if some test expression is true. // Find the absolute value of *a v = *a; if (v < ) v = -v; // This might not be executed v = v + v; Loops cause some statements to be executed many times. // Sum the elements of a five-element array a v = ; t = ; while (t < 5) { v = v + a[t]; // These statements will t++; // be executed five times } 24

MIPS control instructions In this lecture, we introduced some of MIPS s control-flow instructions j immediate // for unconditional jumps jr $r // jump to address stored in $r bne and beq $r, $r2, label // for conditional branches slt and slti $rd, $rs, $rt // set if less than (w/ and w/o an immediate) $rs, $rt, imm And how to implement loops Today, we ll talk about MIPS s pseudo branches if/else case/switch 25

Pseudo-branches The MIPS processor only supports two branch instructions, beq and bne, but to simplify your life the assembler provides the following other branches: blt $t, $t, L // Branch if $t < $t ble $t, $t, L2 // Branch if $t <= $t bgt $t, $t, L3 // Branch if $t > $t bge $t, $t, L4 // Branch if $t >= $t Later this term we ll see how supporting just beq and bne simplifies the processor design. 26

Implementing pseudo-branches Most pseudo-branches are implemented using slt. For example, a branch-if-less-than instruction blt $a, $a, Label is translated into the following. slt $at, $a, $a // $at = if $a < $a bne $at, $, Label // Branch if $at!= All of the pseudo-branches need a register to save the result of slt, even though it s not needed afterwards. MIPS assemblers use register $, or $at, for temporary storage. You should be careful in using $at in your own programs, as it may be overwritten by assembler-generated code. 27

Translating an if-then statement We can use branch instructions to translate if-then statements into MIPS assembly code. v = *a; lw $t, ($a) if (v < ) bge $t, $zero, label v = -v; sub $t, $zero, $t v = v + v; label: add $t, $t, $t Sometimes it s easier to invert the original condition. In this case, we changed continue if v < to skip if v >=. This saves a few instructions in the resulting assembly code. 28

Translating an if-then-else statements If there is an else clause, it is the target of the conditional branch And the then clause needs a jump over the else clause // increase the magnitude of v by one if (v < ) bge $v, $, E v --; sub $v, $v, j L else v ++; E: add $v, $v, v = v; L: move $v, $v Dealing with else-if code is similar, but the target of the first branch will be another if statement. Drawing the control-flow graph can help you out. 29

Example of a Loop Structure for (i=; i>; i--) x[i] = x[i] + h; Assume: addresses of x[] and x[] are in $s and $s5 respectively; h is in $s2; Loop: lw $s, ($s) ;$s=x[] add $s3, $s, $s2 ;$s2=h sw $s3, ($s) addi $s, $s, # - 4 bne $s, $s5, Loop ;$s5=x[] 3

Case/Switch statement Many high-level languages support multi-way branches, e.g. switch (two_bits) { case : break; case : /* fall through */ case 2: count ++; break; case 3: count += 2; break; } We could just translate the code to if, thens, and elses: if ((two_bits == ) (two_bits == 2)) { count ++; } else if (two_bits == 3) { count += 2; } This isn t very efficient if there are many, many cases. 3

Case/Switch statement } switch (two_bits) { case : break; case : /* fall through */ case 2: count ++; break; case 3: count += 2; break; Alternatively, we can:. Create an array of jump targets jump table 2. Load the entry indexed by the variable two_bits 3. Jump to that address using the jump register, or jr, instruction jr $r This is much easier to show than to tell. 32

Coding with jump table (sketch) Suppose the jump table is stored in the memory. Its starting address is in $t. If two_bits==, the branch should jump to the 2 nd entry in the table, i.e., our target address is $t+4. Assume two_bits is in $t: /* test the range of two_bits */ blt $t, $zero, Exit bge $t, $a, Exit /* $a==4 */ /* multiply two_bits by 4, to get byte addr */ sll $t, $t, 2 /* get the target address */ add $t, $t, $t lw $t2, ($t) /* jump */ jr $t2 33

Homework Let s write a program to count how many bits are one in a 32-bit word. Suppose the word is stored in register $t. C code int input, i, counter, bit, position; counter = ; position = ; For (i=; i<32; i++) { bit = input & position; if (bit = = ) counter++ position = position >> ; } 34

Functions calls in MIPS We ll talk about the 3 steps in handling function calls:. The program s flow of control must be changed. 2. Arguments and return values are passed back and forth. 3. Local variables can be allocated and destroyed. And how they are handled in MIPS: New instructions for calling functions. Conventions for sharing registers between functions. Use of a stack. 35

Control flow in C Invoking a function changes the control flow of a program twice.. Calling the function 2. Returning from the function In this example the main function calls fact twice, and fact returns twice but to different locations in main. Each time fact is called, the CPU has to remember the appropriate return address. Notice that main itself is also a function! It is called by the operating system when you run the program. int main() {... t = fact(8); t3 = t + t2; t2 = fact(3);... } int fact(int n) { int i, f = ; for (i = n; i > ; i--) f = f * i; return f; } 36

Control flow in MIPS MIPS uses the jump-and-link instruction jal to call functions. The jal saves the return address (the address of the next instruction) in the dedicated register $ra, before jumping to the function. jal is the only MIPS instruction that can access the value of the program counter, so it can store the return address PC+4 in $ra. jal Fact To transfer control back to the caller, the function just has to jump to the address that was stored in $ra. jr $ra Let s now add the jal and jr instructions that are necessary for our factorial example. 37

Changing the control flow in MIPS int main() {... jal Fact;... } t3 = t + t2;... jal Fact;... int fact(int n) { int i, f = ; for (i = n; i > ; i--) f = f * i; jr $ra; } 38

Data flow in C Functions accept arguments and produce return values. The black parts of the program show the actual and formal arguments of the fact function. The purple parts of the code deal with returning and using a result. int main() {... t = fact(8); t3 = t + t2; t2 = fact(3); }... int fact(int n) { int i, f = ; for (i = n; i > ; i--) f = f * i; return f; } 39

Data flow in MIPS MIPS uses the following conventions for function arguments and results. Up to four function arguments can be passed by placing them in argument registers $a-$a3 before calling the function with jal. A function can return up to two values by placing them in registers $v-$v, before returning via jr. These conventions are not enforced by the hardware or assembler, but programmers agree to them so functions written by different people can interface with each other. Later we ll talk about handling additional arguments or return values. 4

Nested functions What happens when you call a function that then calls another function? Let s say A calls B, which calls C. The arguments for the call to C would be placed in $a-$a3, thus overwriting the original arguments for B. Similarly, jal C overwrites the return address that was saved in $ra by the earlier jal B. A:... # Put B s args in $a-$a3 jal B # $ra = A2 A2:... B:... # Put C s args in $a-$a3, # erasing B s args! jal C # $ra = B2 B2:... jr $ra # Where does # this go??? C:... jr $ra 4

Spilling registers The CPU has a limited number of registers for use by all functions, and it s possible that several functions will need the same registers. We can keep important registers from being overwritten by a function call, by saving them before the function executes, and restoring them after the function completes. But there are two important questions. Who is responsible for saving registers the caller or the callee? Where exactly are the register contents saved? 42

Who saves the registers? However, in the typical black box programming approach, the caller and callee do not know anything about each other s implementation. Different functions may be written by different people or companies. A function should be able to interface with any client, and different implementations of the same function should be substitutable. Who is responsible for saving important registers across function calls? The caller knows which registers are important to it and should be saved. The callee knows exactly which registers it will use and potentially overwrite. So how can two functions cooperate and share registers when they don t know anything about each other? 43

The caller could save the registers One possibility is for the caller to save any important registers that it needs before making a function call, and to restore them after. But the caller does not know what registers are actually written by the function, so it may save more registers than necessary. In the example on the right, frodo wants to preserve $a, $a, $s and $s from gollum, but gollum may not even use those registers. frodo: li $a, 3 li $a, li $s, 4 li $s, # Save registers # $a, $a, $s, $s jal gollum # Restore registers # $a, $a, $s, $s add $v, $a, $a add $v, $s, $s jr $ra 44

or the callee could save the registers Another possibility is if the callee saves and restores any registers it might overwrite. For instance, a gollum function that uses registers $a, $a2, $s and $s2 could save the original values first, and restore them before returning. But the callee does not know what registers are important to the caller, so again it may save more registers than necessary. gollum: # Save registers # $a $a2 $s $s2 li $a, 2 li $a2, 7 li $s, li $s2, 8... # Restore registers # $a $a2 $s $s2 jr $ra 45

or they could work together MIPS uses conventions again to split the register spilling chores. The caller is responsible for saving and restoring any of the following caller-saved registers that it cares about. $t-$t9 $a-$a3 $v-$v In other words, the callee may freely modify these registers, under the assumption that the caller already saved them if necessary. The callee is responsible for saving and restoring any of the following callee-saved registers that it uses. (Remember that $ra is used by jal.) $s-$s7 $ra Thus the caller may assume these registers are not changed by the callee. $ra is tricky; it is saved by a callee who is also a caller. Be especially careful when writing nested functions, which act as both a caller and a callee! 46

Register spilling example This convention ensures that the caller and callee together save all of the important registers frodo only needs to save registers $a and $a, while gollum only has to save registers $s and $s2. frodo: li $a, 3 li $a, li $s, 4 li $s, # Save registers # $a, $a jal gollum # Restore registers # $a and $a add $v, $a, $a add $v, $s, $s jr $ra gollum: # Save $ra # Save registers # $s and $s2 li $a, 2 li $a2, 7 li $s, li $s2, 8... # Restore registers # $s and $s2 # Restore $ra jr $ra 47

Where are the registers saved? Now we know who is responsible for saving which registers, but we still need to discuss where those registers are saved. It would be nice if each function call had its own private memory area. This would prevent other function calls from overwriting our saved registers. We could use this private memory for other purposes too, like storing local variables. 48

Function calls and stacks Notice function calls and returns occur in a stack-like order: the most recently called function is the first one to return. 2 A:... jal B A2:... 6. Someone calls A jr $ra 5 2. A calls B B:... 3. B calls C jal C 4. C returns to B 5. B returns to A 3 B2:... jr $ra 4 6. A returns Here, for example, C must return to B before B can return to A. C:... jr $ra 49

Stacks and function calls It s natural to use a stack for function call storage. A block of stack space, called a stack frame, can be allocated for each function call. When a function is called, it creates a new frame onto the stack, which will be used for local storage. Before the function returns, it must pop its stack frame, to restore the stack to its original state. The stack frame can be used for several purposes. Caller- and callee-save registers can be put in the stack. The stack frame can also hold local variables, or extra arguments and return values. 5

The MIPS stack In MIPS machines, part of main memory is reserved for a stack. The stack grows downward in terms of memory addresses. The address of the top element of the stack is stored (by convention) in the stack pointer register, $sp ($29). MIPS does not provide push and pop instructions. Instead, they must be done explicitly by the programmer. x7fffffff $sp stack x 5

Pushing elements To push elements onto the stack: Move the stack pointer $sp down to make room for the new data. Store the elements into the stack. For example, to push registers $t and $t2 onto the stack: sub $sp, $sp, 8 sw $t, 4($sp) sw $t2, ($sp) An equivalent sequence is: sw $t, -4($sp) sw $t2, -8($sp) sub $sp, $sp, 8 Before and after diagrams of the stack are shown on the right. $sp $sp word word 2 Before word word 2 $t $t2 After 52

Accessing and popping elements You can access any element in the stack (not just the top one) if you know where it is relative to $sp. For example, to retrieve the value of $t: lw $s, 4($sp) You can pop, or erase, elements simply by adjusting the stack pointer upwards. To pop the value of $t2, yielding the stack shown at the bottom: addi $sp, $sp, 4 Note that the popped data is still present in memory, but data past the stack pointer is considered invalid. $sp $sp word word 2 $t $t2 word word 2 $t $t2 53

Summary Today we focused on implementing function calls in MIPS. We call functions using jal, passing arguments in registers $a- $a3. Functions place results in $v-$v and return using jr $ra. Managing resources is an important part of function calls. To keep important data from being overwritten, registers are saved according to conventions for caller-save and callee-save registers. Each function call uses stack memory for saving registers, storing local variables and passing extra arguments and return values. Assembly programmers must follow many conventions. Nothing prevents a rogue program from overwriting registers or stack memory used by some other function. 54

Assembly vs. machine language So far we ve been using assembly language. We assign names to operations (e.g., add) and operands (e.g., $t). Branches and jumps use labels instead of actual addresses. Assemblers support many pseudo-instructions. Programs must eventually be translated into machine language, a binary format that can be stored in memory and decoded by the CPU. MIPS machine language is designed to be easy to decode. Each MIPS instruction is the same length, 32 bits. There are only three different instruction formats, which are very similar to each other. Studying MIPS machine language will also reveal some restrictions in the instruction set architecture, and how they can be overcome. 55

Three MIPS formats simple instructions all 32 bits wide very structured, no unnecessary baggage only three instruction formats R I J op rs rt rd shamt funct op rs rt 6 bit address op 26 bit address Signed value -32768 ~ +32767 R-type: *I-type: J-type: ALU instructions (add, sub, ) immediate (addi ), loads (lw ), stores (sw ), conditional branches (bne ), jump register (jr ) jump (j), jump and link (jal) 56

Constants Small constants are used quite frequently (5% of operands) e.g., A = A + 5; B = B + ; C = C - 8; MIPS Instructions: addi $29, $29, 4 slti $8, $8, andi $29, $29, 6 ori $29, $29, 4 57

Larger constants Larger constants can be loaded into a register 6 bits at a time. The load upper immediate instruction lui loads the highest 6 bits of a register with a constant, and clears the lowest 6 bits to s. An immediate logical OR, ori, then sets the lower 6 bits. To load the 32-bit value : lui $s, x3d # $s = 3D (in hex) ori $s, $s, x9 # $s = 3D 9 This illustrates the principle of making the common case fast. Most of the time, 6-bit constants are enough. It s still possible to load 32-bit constants, but at the cost of two instructions and one temporary register. Pseudo-instructions may contain large constants. Assemblers will translate such instructions correctly. We used a lw instruction before. 58

Loads and stores The limited 6-bit constant can present difficulties for accesses to global data. Let s assume the assembler puts a variable at address x4. x4 is bigger than 32,767 In these situations, the assembler breaks the immediate into two pieces. lui $t, x # x lw $t, x4($t) # from Mem[x 4] 59

Branches For branch instructions, the constant field is not an address, but an offset from the next program counter (PC+4) to the target address. beq $at, $, L add $v, $v, $ add $v, $v, $v j Somewhere L: add $v, $v, $v Since the branch target L is three instructions past the first add, the address field would contain 3 4=2. The whole beq instruction would be stored as: op rs rt address Why (PC+4)? Will be clear when we learned pipelining 6

Larger branch constants Empirical studies of real programs show that most branches go to targets less than 32,767 instructions away branches are mostly used in loops and conditionals, and programmers are taught to make code bodies short. If you do need to branch further, you can use a jump with a branch. For example, if Far is very far away, then the effect of: beq... $s, $s, Far can be simulated with the following actual code. bne j Next:... $s, $s, Next Far Again, the MIPS designers have taken care of the common case first. 6

Summary Instruction Set Architecture (ISA) The ISA is the interface between hardware and software. The ISA serves as an abstraction layer between the HW and SW Software doesn t need to know how the processor is implemented Any processor that implements the ISA appears equivalent Software ISA Proc # Proc #2 An ISA enables processor innovation without changing software This is how Intel has made billions of dollars. Before ISAs, software was re-written for each new machine. 62

RISC vs. CISC MIPS was one of the first RISC architectures. It was started about 2 years ago by John Hennessy, one of the authors of our textbook. The architecture is similar to that of other RISC architectures, including Sun s SPARC, IBM and Motorola s PowerPC, and ARM-based processors. Older processors used complex instruction sets, or CISC architectures. Many powerful instructions were supported, making the assembly language programmer s job much easier. But this meant that the processor was more complex, which made the hardware designer s life harder. Many new processors use reduced instruction sets, or RISC architectures. Only relatively simple instructions are available. But with high-level languages and compilers, the impact on programmers is minimal. On the other hand, the hardware is much easier to design, optimize, and teach in classes. Even most current CISC processors, such as Intel 886-based chips, are now implemented using a lot of RISC techniques. 63

RISC vs. CISC Characteristics of ISAs CISC Variable length instruction Variable format Memory operands Complex operations RISC Single word instruction Fixed-field decoding Load/store architecture Simple operations 64

A little ISA history 964: IBM System/36, the first computer family IBM wanted to sell a range of machines that ran the same software 96 s, 97 s: Complex Instruction Set Computer (CISC) era Much assembly programming, compiler technology immature Simple machine implementations Complex instructions simplified programming, little impact on design 98 s: Reduced Instruction Set Computer (RISC) era Most programming in high-level languages, mature compilers Aggressive machine implementations Simpler, cleaner ISA s facilitated pipelining, high clock frequencies 99 s: Post-RISC era ISA complexity largely relegated to non-issue CISC and RISC chips use same techniques (pipelining, superscalar,..) ISA compatibility outweighs any RISC advantage in general purpose Embedded processors prefer RISC for lower power, cost 2 s:??? EPIC? Dynamic Translation? 65

CoE/ECE 42 Computer Organization Pipelining Instructor: Jun Yang Slides are adapted from Zilles 998 Morgan Kaufmann Publishers

A relevant question Assuming you ve got: One washer (takes 3 minutes) One drier (takes 4 minutes) One folder (takes 2 minutes) It takes 9 minutes to wash, dry, and fold load of laundry. How long does 4 loads take? 998 Morgan Kaufmann Publishers 2

The slow way 6 PM 7 8 9 Midnight Time 3 4 2 3 4 2 3 4 2 3 4 2 If each load is done sequentially it takes 6 hours 998 Morgan Kaufmann Publishers 3

Laundry Pipelining Start each load as soon as possible Overlap loads 6 PM 7 8 9 Midnight Time 3 4 4 4 4 2 Pipelined laundry takes 3.5 hours 998 Morgan Kaufmann Publishers 4

Pipelining Lessons 6 PM 7 8 9 Time 3 4 4 4 4 2 Pipelining doesn t help latency of single load, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stage Multiple tasks operating simultaneously using different resources Potential speedup = Number pipe stages Unbalanced lengths of pipe stages reduces speedup Time to fill pipeline and time to drain it reduces speedup 998 Morgan Kaufmann Publishers 5

Pipelining Pipelining is a general-purpose efficiency technique It is not specific to processors Pipelining is used in: Assembly lines Fast food restaurants Pipelining gives the best of both worlds and is used in just about every modern processor. 998 Morgan Kaufmann Publishers 6

Instruction execution review Executing a MIPS instruction can take up to five steps. Step Name Description Instruction Fetch IF an instruction from memory. Instruction Decode ID source registers and generate control signals. Execute EX Compute an R-type result or a branch outcome. Memory MEM or write the data memory. Writeback WB Store a result in the destination register. However, as we saw, not all instructions need all five steps. Instruction Steps required beq IF ID EX R-type IF ID EX WB sw IF ID EX MEM lw IF ID EX MEM WB 998 Morgan Kaufmann Publishers 7

Single-cycle datapath diagram PC Instruction address [3-] Instruction memory 2ns 4 I [25-2] I [2-6] I [5 - ] Add M u x register register 2 Write register Write data ns RegWrite data data 2 Registers Shift left 2 M u x Add 2ns ALU Zero Result ALUOp M u x PCSrc address Write address Write data 2ns MemWrite data Data memory MemToReg M u x RegDst ALUSrc Mem I [5 - ] Sign extend How long does it take to execute each instruction? 998 Morgan Kaufmann Publishers 8

Single-cycle review All five execution steps occur in one clock cycle. Each hardware element can only be used once per clock cycle. A lw or sw must access memory twice (in the IF and MEM stages), so there are separate instruction and data memories. There are multiple adders, since each instruction increments the PC (IF) and performs another computation (EX). On top of that, branches also need to compute a target address. 998 Morgan Kaufmann Publishers 9

Review: Instruction Fetch (IF) Let s quickly review how lw is executed in the single-cycle datapath. We ll ignore PC incrementing and branching for now. In the Instruction Fetch (IF) step, we read the instruction memory. RegWrite Instruction address [3-] Instruction memory I [25-2] I [2-6] I [5 - ] M u x register register 2 Write register Write data data data 2 Registers M u x ALU Zero Result ALUOp address Write address Write data MemWrite data Data memory MemToReg M u x RegDst ALUSrc Mem I [5 - ] Sign extend 998 Morgan Kaufmann Publishers

Instruction Decode (ID) The Instruction Decode (ID) step reads the source register from the register file. RegWrite Instruction address [3-] Instruction memory I [25-2] I [2-6] I [5 - ] M u x register register 2 Write register Write data data data 2 Registers M u x ALU Zero Result ALUOp address Write address Write data MemWrite data Data memory MemToReg M u x RegDst ALUSrc Mem I [5 - ] Sign extend 998 Morgan Kaufmann Publishers

Execute (EX) The third step, Execute (EX), computes the effective memory address from the source register and the instruction s constant field. RegWrite Instruction address [3-] Instruction memory I [25-2] I [2-6] I [5 - ] M u x register register 2 Write register Write data data data 2 Registers M u x ALU Zero Result ALUOp address Write address Write data MemWrite data Data memory MemToReg M u x RegDst ALUSrc Mem I [5 - ] Sign extend 998 Morgan Kaufmann Publishers 2

Memory (MEM) The Memory (MEM) step involves reading the data memory, from the address computed by the ALU. RegWrite Instruction address [3-] Instruction memory I [25-2] I [2-6] I [5 - ] M u x register register 2 Write register Write data data data 2 Registers M u x ALU Zero Result ALUOp address Write address Write data MemWrite data Data memory MemToReg M u x RegDst ALUSrc Mem I [5 - ] Sign extend 998 Morgan Kaufmann Publishers 3

Writeback (WB) Finally, in the Writeback (WB) step, the memory value is stored into the destination register. RegWrite Instruction address [3-] Instruction memory I [25-2] I [2-6] I [5 - ] M u x register register 2 Write register Write data data data 2 Registers M u x ALU Zero Result ALUOp address Write address Write data MemWrite data Data memory MemToReg M u x RegDst ALUSrc Mem I [5 - ] Sign extend 998 Morgan Kaufmann Publishers 4

A bunch of lazy functional units Notice that each execution step uses a different functional unit. In other words, the main units are idle for most of the 8ns cycle! The instruction RAM is used for just 2ns at the start of the cycle. Registers are read once in ID (ns), and written once in WB (ns). The ALU is used for 2ns near the middle of the cycle. ing the data memory only takes 2ns as well. That s a lot of hardware sitting around doing nothing. 998 Morgan Kaufmann Publishers 5

Putting those slackers to work We shouldn t have to wait for the entire instruction to complete before we can re-use the functional units. For example, the instruction memory is free in the Instruction Decode step as shown below, so... Idle Instruction Decode (ID) RegWrite Instruction address [3-] Instruction memory I [25-2] I [2-6] I [5 - ] M u x register register 2 Write register Write data data data 2 Registers M u x ALU Zero Result ALUOp address Write address Write data MemWrite data Data memory MemToReg M u x RegDst ALUSrc Mem I [5 - ] Sign extend 998 Morgan Kaufmann Publishers 6

Decoding and fetching together Why don t we go ahead and fetch the next instruction while we re decoding the first one? Fetch 2nd Decode st instruction RegWrite address Instruction [3-] Instruction memory I [25-2] I [2-6] I [5 - ] M u x register register 2 Write register Write data data data 2 Registers M u x ALU Zero Result ALUOp address Write address Write data MemWrite data Data memory MemToReg M u x RegDst ALUSrc Mem I [5 - ] Sign extend 998 Morgan Kaufmann Publishers 7

Executing, decoding and fetching Similarly, once the first instruction enters its Execute stage, we can go ahead and decode the second instruction. But now the instruction memory is free again, so we can fetch the third instruction! Fetch 3rd Decode 2nd Execute st RegWrite Instruction address [3-] Instruction memory I [25-2] I [2-6] I [5 - ] M u x register register 2 Write register Write data data data 2 Registers M u x ALU Zero Result ALUOp address Write address Write data MemWrite data Data memory MemToReg M u x RegDst ALUSrc Mem I [5 - ] Sign extend 998 Morgan Kaufmann Publishers 8

Making Pipelining Work We ll make our pipeline 5 stages long, to handle load instructions as they were handled in the multi-cycle implementation Stages are: IF, ID, EX, MEM, and WB We want to support executing 5 instructions simultaneously: one in each stage. 998 Morgan Kaufmann Publishers 9

Break datapath into 5 stages Each stage has its own functional units. Each stage can execute in 2ns Just like the multi-cycle implementation IF ID EXE MEM WB RegWrite Instruction address [3-] Instruction memory I [25-2] I [2-6] I [5 - ] M u x register register 2 Write register Write data data data 2 Registers M u x ALU Zero Result ALUOp address Write address Write data MemWrite data Data memory MemToReg M u x RegDst ALUSrc Mem I [5 - ] Sign extend 2ns ns 2ns 2ns 998 Morgan Kaufmann Publishers 2

Pipelining Loads Clock cycle 2 3 4 5 6 7 8 9 lw $t, 4($sp) IF ID EX MEM WB lw $t, 8($sp) IF ID EX MEM WB lw $t2, 2($sp) IF ID EX MEM WB lw $t3, 6($sp) IF ID EX MEM WB lw $t4, 2($sp) IF ID EX MEM WB A pipeline diagram shows the execution of a series of instructions. The instruction sequence is shown vertically, from top to bottom. Clock cycles are shown horizontally, from left to right. Each instruction is divided into its component stages. (We show five stages for every instruction, which will make the control unit easier.) This clearly indicates the overlapping of instructions. For example, there are three instructions active in the third cycle above. The lw $t instruction is in its Execute stage. Simultaneously, the lw $t is in its Instruction Decode stage. Also, the lw $t2 instruction is just being fetched. 998 Morgan Kaufmann Publishers 2

Pipelining terminology Clock cycle 2 3 4 5 6 7 8 9 lw $t, 4($sp) IF ID EX MEM WB lw $t, 8($sp) IF ID EX MEM WB lw $t2, 2($sp) IF ID EX MEM WB lw $t3, 6($sp) IF ID EX MEM WB lw $t4, 2($sp) IF ID EX MEM WB The pipeline depth is the number of stages in this case, five. In the first four cycles here, the pipeline is filling, since there are unused functional units. In cycle 5, the pipeline is full. Five instructions are being executed simultaneously, so all hardware units are in use. In cycles 6-9, the pipeline is emptying. filling full emptying 998 Morgan Kaufmann Publishers 22

Pipelining Performance Clock cycle 2 3 4 5 6 7 8 9 lw $t, 4($sp) IF ID EX MEM WB lw $t, 8($sp) IF ID EX MEM WB lw $t2, 2($sp) IF ID EX MEM WB lw $t3, 6($sp) IF ID EX MEM WB lw $t4, 2($sp) IF ID EX MEM WB Execution time on ideal pipeline: time to fill the pipeline + one cycle per instruction How long for N instructions? filling Compared to single-cycle design, how much faster is pipelining for N=? 998 Morgan Kaufmann Publishers 23

Pipeline Datapath: Resource Requirements Clock cycle 2 3 4 5 6 7 8 9 lw $t, 4($sp) IF ID EX MEM WB lw $t, 8($sp) IF ID EX MEM WB lw $t2, 2($sp) IF ID EX MEM WB lw $t3, 6($sp) IF ID EX MEM WB lw $t4, 2($sp) IF ID EX MEM WB We need to perform several operations in the same cycle. Increment the PC and add registers at the same time. Fetch one instruction while another one reads or writes data. What does that mean for our hardware? 998 Morgan Kaufmann Publishers 24

Pipelining other instruction types R-type instructions only require 4 stages: IF, ID, EX, and WB We don t need the MEM stage What happens if we try to pipeline loads with R-type instructions? Clock cycle 2 3 4 5 6 7 8 9 add $sp, $sp, -4 IF ID EX WB sub $v, $a, $a IF ID EX WB lw $t, 4($sp) IF ID EX MEM WB or $s, $s, $s2 IF ID EX WB lw $t, 8($sp) IF ID EX MEM WB Load uses Register File s Write Port during its 5 th (cycle 7) stage R-type uses Register File s Write Port during its 4th (cycle 7) stage 998 Morgan Kaufmann Publishers 25

A solution: Insert NOP stages Enforce uniformity Make all instructions take 5 cycles. Make them have the same stages, in the same order Some stages will do nothing for some instructions R-type IF ID EX NOP WB Clock cycle 2 3 4 5 6 7 8 9 add $sp, $sp, -4 IF ID EX NOP WB sub $v, $a, $a IF ID EX NOP WB lw $t, 4($sp) IF ID EX MEM WB or $s, $s, $s2 IF ID EX NOP WB lw $t, 8($sp) IF ID EX MEM WB Stores and Branches have NOP stages, too store IF ID EX MEM NOP branch IF ID EX NOP NOP 998 Morgan Kaufmann Publishers 26

What we have so far Pipelining attempts to maximize instruction throughput by overlapping the execution of multiple instructions. Pipelining offers amazing speedup. In the best case, one instruction finishes on every cycle, and the speedup is equal to the pipeline depth. The pipeline datapath is much like the single-cycle one, but with added pipeline registers Each stage needs is own functional units Next we ll see the datapath and control, and walk through an example execution. 998 Morgan Kaufmann Publishers 27

Pipelined Datapath and Control Last time we introduced the main ideas of pipelining. Today we ll see a basic implementation of a pipelined processor. The datapath and control unit share similarities with both the single-cycle and multicycle implementations that we already saw. An example execution highlights important pipelining concepts. In future lectures, we ll discuss several complications of pipelining that we re hiding from you for now. 998 Morgan Kaufmann Publishers 28

Pipelining Concepts A pipelined processor allows multiple instructions to execute at once, and each instruction uses a different functional unit in the datapath. This increases throughput, so programs can run faster. One instruction can finish executing on every clock cycle, and simpler stages also lead to shorter cycle times. Clock cycle 2 3 4 5 6 7 8 9 lw $t, 4($sp) IF ID EX MEM WB sub $v, $a, $a IF ID EX MEM WB and $t, $t2, $t3 IF ID EX MEM WB or $s, $s, $s2 IF ID EX MEM WB add $t5, $t6, $ IF ID EX MEM WB 998 Morgan Kaufmann Publishers 29

Pipelined Datapath The whole point of pipelining is to allow multiple instructions to execute at the same time. We may need to perform several operations in the same cycle. Increment the PC and add registers at the same time. Fetch one instruction while another one reads or writes data. Clock cycle 2 3 4 5 6 7 8 9 lw $t, 4($sp) IF ID EX MEM WB sub $v, $a, $a IF ID EX MEM WB and $t, $t2, $t3 IF ID EX MEM WB or $s, $s, $s2 IF ID EX MEM WB add $t5, $t6, $ IF ID EX MEM WB Thus, like the single-cycle datapath, a pipelined processor will need to duplicate hardware elements that are needed several times in the same clock cycle. 998 Morgan Kaufmann Publishers 3

One register file is enough We need only one register file to support both the ID and WB stages. register register 2 Write register Write data data data 2 Registers s and writes go to separate ports on the register file. We already took advantage of this property in our single-cycle CPU. 998 Morgan Kaufmann Publishers 3

Single-cycle datapath, slightly rearranged PCSrc 4 Add P C RegWrite Shift left 2 Add address Instruction [3-] register register 2 data data 2 ALU Zero Result MemWrite Address Instruction memory Write register Write data Registers ALUSrc ALUOp Write data Data memory data MemToReg Instr [5 - ] Instr [2-6] Instr [5 - ] Sign extend RegDst Mem 998 Morgan Kaufmann Publishers 32

Multiple cycles In pipelining, we also divide instruction execution into multiple cycles. Information computed during one cycle may be needed in a later cycle. The instruction read in the IF stage determines which registers are fetched in the ID stage, what constant is used for the EX stage, and what the destination register is for WB. The registers read in ID are used in the EX and/or MEM stages. The ALU output produced in the EX stage is an effective address for the MEM stage or a result for the WB stage. We added several intermediate registers to the multicycle datapath to preserve information between stages, as highlighted on the next slide. 998 Morgan Kaufmann Publishers 33

Registers added to the multi-cycle PCWrite PC IorD ALUSrcA M u x Mem Address Write data Memory MemWrite Mem Data IRWrite [3-26] [25-2] [2-6] [5-] [5-] RegDst M u x RegWrite register register 2 Write register Write data data data 2 Registers A B 4 M u x 2 3 ALU Zero Result ALUOp ALU Out M u x PCSource Instruction register Memory data register M u x Sign extend Shift left 2 ALUSrcB MemToReg 998 Morgan Kaufmann Publishers 34

Pipeline registers We ll add intermediate registers to our pipelined datapath too. There s a lot of information to save, however. We ll simplify our diagrams by drawing just one big pipeline register between each stage. The registers are named for the stages they connect. IF/ID ID/EX EX/MEM MEM/WB No register is needed after the WB stage, because after WB the instruction is done. 998 Morgan Kaufmann Publishers 35

Pipelined datapath PCSrc 4 IF/ID ID/EX EX/MEM MEM/WB Add P C RegWrite Shift left 2 Add address Instruction [3-] register register 2 data data 2 ALU Zero Result MemWrite Address Instruction memory Write register Write data Registers ALUSrc ALUOp Write data Data memory data MemToReg Instr [5 - ] Instr [2-6] Instr [5 - ] Sign extend RegDst Mem 998 Morgan Kaufmann Publishers 36

Propagating values forward Any data values required in later stages must be propagated through the pipeline registers. The most extreme example is the destination register. The rd field of the instruction word, retrieved in the first stage (IF), determines the destination register. But that register isn t updated until the fifth stage (WB). Thus, the rd field must be passed through all of the pipeline stages, as shown in red on the next slide. 998 Morgan Kaufmann Publishers 37

The destination register PCSrc 4 IF/ID ID/EX EX/MEM MEM/WB Add P C RegWrite Shift left 2 Add address Instruction [3-] register register 2 data data 2 ALU Zero Result MemWrite Address Instruction memory Write register Write data Registers ALUSrc ALUOp Write data Data memory data MemToReg Instr [5 - ] Instr [2-6] Instr [5 - ] Sign extend RegDst Mem 998 Morgan Kaufmann Publishers 38

What about control signals? The control signals are generated in the same way as in the single-cycle processor after an instruction is fetched, the processor decodes it and produces the appropriate control values. But just like before, some of the control signals will not be needed until some later stage and clock cycle. These signals must be propagated through the pipeline until they reach the appropriate stage. We can just pass them in the pipeline registers, along with the other data. Control signals can be categorized by the pipeline stage that uses them. Stage Control signals needed EX ALUSrc ALUOp RegDst MEM Mem MemWrite PCSrc WB RegWrite MemToReg 998 Morgan Kaufmann Publishers 39

Pipelined datapath and control ID/EX PCSrc Control WB M EX/MEM WB MEM/WB 4 IF/ID EX M WB Add P C RegWrite Shift left 2 Add address Instruction [3-] register register 2 data data 2 ALU Zero Result MemWrite Address Instruction memory Write register Write data Registers ALUSrc ALUOp Write data Data memory data MemToReg Instr [5 - ] Instr [2-6] Instr [5 - ] Sign extend RegDst Mem 998 Morgan Kaufmann Publishers 4

Notes about the diagram The control signals are grouped together in the pipeline registers, just to make the diagram a little clearer. Not all of the registers have a write enable signal. Because the datapath fetches one instruction per cycle, the PC must also be updated on each clock cycle. Including a write enable for the PC would be redundant. Similarly, the pipeline registers are also written on every cycle, so no explicit write signals are needed. 998 Morgan Kaufmann Publishers 4

An example execution sequence Here s a sample sequence of instructions to execute. addresses in decimal : lw $8, 4($29) 4: sub $2, $4, $5 8: and $9, $, $ 2: or $6, $7, $8 6: add $3, $4, $ We ll make some assumptions, just so we can show actual data values. Each register contains its number plus. For instance, register $8 contains 8, register $29 contains 29, and so forth. Every data memory location contains 99. Our pipeline diagrams will follow some conventions. An X indicates values that aren t important, like the constant field of an R-type instruction. Question marks??? indicate values we don t know, usually resulting from instructions coming before and after the ones in our example. 998 Morgan Kaufmann Publishers 42

Cycle (filling) IF: lw $8, 4($29) ID:??? EX:??? MEM:??? WB:??? ID/EX WB EX/MEM PCSrc Control M WB MEM/WB 4 IF/ID EX M WB P C Add 4 RegWrite (?) Shift left 2 Add address Instruction [3-] Instruction memory???????????? register register 2 Write register Write data data data 2 Registers???????????? ALUSrc (?) ALU Zero Result??? ALUOp (???)?????? MemWrite (?) Address Write data Data memory data MemToReg (?)???????????? Sign extend????????? RegDst (?)?????? Mem (?)?????? 998 Morgan Kaufmann Publishers??? 43

Cycle 2 IF: sub $2, $4, $5 ID: lw $8, 4($29) EX:??? MEM:??? WB:??? ID/EX WB EX/MEM PCSrc Control M WB MEM/WB 4 IF/ID EX M WB Add P C 8 RegWrite (?) Shift left 2 Add 4 address Instruction [3-] 29 X register register 2 data data 2 29 X?????? ALU Zero Result??? MemWrite (?) Address Instruction memory?????? Write register Write data Registers ALUSrc (?)??? ALUOp (???)??? Write data Data memory data MemToReg (?)??? 4 8 X Sign extend????????? RegDst (?)?????? Mem (?)????????? 998 Morgan Kaufmann Publishers 44

Cycle 3 IF: and $9, $, $ ID: sub $2, $4, $5 EX: lw $8, 4($29) MEM:??? WB:??? ID/EX WB EX/MEM PCSrc Control M WB MEM/WB 4 IF/ID EX M WB P C Add 2 RegWrite (?) Shift left 2 Add 8 address Instruction [3-] Instruction memory 4 5?????? register register 2 Write register Write data data data 2 Registers 4 5 X 29 4 ALUSrc () ALU Zero Result 33 ALUOp (add)?????? MemWrite (?) Address Write data Data memory data MemToReg (?)??? X X 2 Sign extend 4 8 X RegDst () 8??? Mem (?)????????? 998 Morgan Kaufmann Publishers 45

Cycle 4 IF: or $6, $7, $8 ID: and $9, $, $ EX: sub $2, $4, $5 MEM: lw $8, 4($29) WB:??? ID/EX WB EX/MEM PCSrc Control M WB MEM/WB 4 IF/ID EX M WB P C Add 6 RegWrite (?) Shift left 2 Add 2 address Instruction [3-] Instruction memory?????? register register 2 Write register Write data data data 2 Registers 5 4 ALUSrc () ALU Zero Result ALUOp (sub) 33 X MemWrite () Address Write data Data memory data 99 MemToReg (?)??? X X 9 Sign extend X X 2 RegDst () 2 8 Mem ()?????? 998 Morgan Kaufmann Publishers??? 46

P C Cycle 5 (full) IF: add $3, $4, $ ID: or $6, $7, $8 EX: and $9, $, $ MEM: sub $2, $4, $5 WB: lw $8, 4($29) 4 PCSrc Add 2 IF/ID Control RegWrite () ID/EX WB M EX Shift left 2 Add EX/MEM WB M MEM/WB WB 6 address Instruction [3-] 7 8 register register 2 data data 2 7 8 ALU Zero Result - MemWrite () Address Instruction memory 8 99 Write register Write data Registers ALUSrc () ALUOp (and) 5 Write data Data memory data X MemToReg () 99 X X 6 Sign extend X X 9 RegDst () 9 2 Mem () 33 8 998 Morgan Kaufmann Publishers 99 47

Cycle 6 (emptying) P C 4 IF:??? ID: add $3, $4, $ EX: or $6, $7, $8 MEM: and $9, $, $ WB: sub $2, $4, $5 PCSrc Add??? IF/ID Control RegWrite () ID/EX WB M EX Shift left 2 Add EX/MEM WB M MEM/WB WB 2 address Instruction [3-] Instruction memory 4 2 - register register 2 Write register Write data data data 2 Registers 4 8 7 ALUSrc () ALU Zero Result 9 ALUOp (or) MemWrite () Address Write data Data memory data X MemToReg () X X 3 Sign extend X X 6 RegDst () 6 9 Mem () 998 Morgan Kaufmann Publishers 48

Cycle 7 P C 4 IF:??? ID:??? EX: add $3, $4, $ MEM: or $6, $7, $8 WB: and $9, $, $ PCSrc Add??? IF/ID Control RegWrite () ID/EX WB M EX Shift left 2 Add EX/MEM WB M MEM/WB WB??? address Instruction [3-]?????? register register 2 data data 2?????? 4 ALU Zero Result 9 MemWrite () Address Instruction memory 9 Write register Write data Registers ALUSrc () 4 ALUOp (add) 8 Write data Data memory data X MemToReg () X????????? Sign extend X X 3 RegDst () 3 6 Mem () 9 998 Morgan Kaufmann Publishers 49

Cycle 8 P C 4 IF:??? ID:??? EX:??? MEM: add $3, $4, $ WB: or $6, $7, $8 PCSrc Add??? IF/ID Control RegWrite () ID/EX WB M EX Shift left 2 Add EX/MEM WB M MEM/WB WB??? address Instruction [3-]?????? register register 2 data data 2???????????? ALU Zero Result 4 MemWrite () Address Instruction memory 6 9 Write register Write data Registers ALUSrc (?)??? ALUOp (???) Write data Data memory data X MemToReg () X????????? Sign extend????????? RegDst (?)??? 3 Mem () 9 6 998 Morgan Kaufmann Publishers 9 5

Cycle 9 P C 4 IF:??? ID:??? EX:??? MEM:??? WB: add $3, $4, $ PCSrc Add??? IF/ID Control RegWrite () ID/EX WB M EX Shift left 2 Add EX/MEM WB M MEM/WB WB??? address Instruction [3-]?????? register register 2 data data 2???????????? ALU Zero Result??? MemWrite (?) Address Instruction memory 3 4 Write register Write data Registers ALUSrc (?)??? ALUOp (???)? Write data Data memory data X MemToReg () X????????? Sign extend????????? RegDst (?)?????? Mem (?) 4 3 998 Morgan Kaufmann Publishers 4 5

That s a lot of diagrams there Clock cycle 2 3 4 5 6 7 8 9 lw $t, 4($sp) IF ID EX MEM WB sub $v, $a, $a IF ID EX MEM WB and $t, $t2, $t3 IF ID EX MEM WB or $s, $s, $s2 IF ID EX MEM WB add $t5, $t6, $ IF ID EX MEM WB Compare the last nine slides with the pipeline diagram above. You can see how instruction executions are overlapped. Each functional unit is used by a different instruction in each cycle. The pipeline registers save control and data values generated in previous clock cycles for later use. When the pipeline is full in clock cycle 5, all of the hardware units are utilized. This is the ideal situation, and what makes pipelined processors so fast. 998 Morgan Kaufmann Publishers 52

Performance Revisited Assuming the following functional unit latencies: 3ns 2ns 2ns 3ns 2ns Inst mem Reg ALU Data Mem Reg Write What is the cycle time of a single-cycle implementation? What is its throughput? What is the cycle time of an ideal pipelined implementation? What is its steady-state throughput? How much faster is pipelining? 998 Morgan Kaufmann Publishers 53

Ideal speedup Clock cycle 2 3 4 5 6 7 8 9 lw $t, 4($sp) IF ID EX MEM WB sub $v, $a, $a IF ID EX MEM WB and $t, $t2, $t3 IF ID EX MEM WB or $s, $s, $s2 IF ID EX MEM WB add $sp, $sp, -4 IF ID EX MEM WB In our pipeline, we can execute up to five instructions simultaneously. This implies that the maximum speedup is 5 times. In general, the ideal speedup equals the pipeline depth. Why was our speedup on the previous slide only 4 times? The pipeline stages are imbalanced: a register file and ALU operations can be done in 2ns, but we must stretch that out to 3ns to keep the ID, EX, and WB stages synchronized with IF and MEM. Balancing the stages is one of the many hard parts in designing a pipelined processor. 998 Morgan Kaufmann Publishers 54

The pipelining paradox Clock cycle 2 3 4 5 6 7 8 9 lw $t, 4($sp) IF ID EX MEM WB sub $v, $a, $a IF ID EX MEM WB and $t, $t2, $t3 IF ID EX MEM WB or $s, $s, $s2 IF ID EX MEM WB add $sp, $sp, -4 IF ID EX MEM WB Pipelining does not improve the execution time of any single instruction. Each instruction here actually takes longer to execute than in a single-cycle datapath (5ns vs. 2ns)! Instead, pipelining increases the throughput, or the amount of work done per unit time. Here, several instructions are executed together in each clock cycle. The result is improved execution time for a sequence of instructions, such as an entire program. 998 Morgan Kaufmann Publishers 55

Instruction set architectures and pipelining The MIPS instruction set was designed especially for easy pipelining. All instructions are 32-bits long, so the instruction fetch stage just needs to read one word on every clock cycle. Fields are in the same position in different instruction formats the opcode is always the first six bits, rs is the next five bits, etc. This makes things easy for the ID stage. MIPS is a register-to-register architecture, so arithmetic operations cannot contain memory references. This keeps the pipeline shorter and simpler. Pipelining is harder for older, more complex instruction sets. If different instructions had different lengths or formats, the fetch and decode stages would need extra time to determine the actual length of each instruction and the position of the fields. With memory-to-memory instructions, additional pipeline stages may be needed to compute effective addresses and read memory before the EX stage. 998 Morgan Kaufmann Publishers 56

Summary so far The pipelined datapath combines ideas from the single and multicycle processors that we saw earlier. It uses multiple memories and ALUs. Instruction execution is split into several stages. Pipeline registers propagate data and control values to later stages. The MIPS instruction set architecture supports pipelining with uniform instruction formats and simple addressing modes. Next, we ll start talking about Hazards. 998 Morgan Kaufmann Publishers 57

Welcome to Part 3: Memory Systems and I/O We ve already seen how to make a fast processor. How can we supply the CPU with enough data to keep it busy? We will now focus on memory issues, which are frequently bottlenecks that limit the performance of a system. We ll start off by looking at memory systems in the remaining lectures. Processor Memory Input/Output

Cache introduction Today we ll answer the following questions. What are the challenges of building big, fast memory systems? What is a cache? Why caches work? (answer: locality) How are caches organized? Where do we put things -and- how do we find them? 2