CISC, RISC and post-risc architectures

Size: px
Start display at page:

Download "CISC, RISC and post-risc architectures"

Transcription

1 Microprocessor architecture and instruction execution CISC, RISC and Post-RISC architectures Instruction Set Architecture and microarchitecture Instruction encoding and machine instructions Pipelined and superscalar instruction execution Hazards and dependences Branch prediction Out-of-order instruction execution 32- and 64-bit architectures Multicore architectures and hyperhreading Processor architectures for embedded systems 1 CISC, RISC and post-risc architectures CISC Complex Instruction Set Computer large instruction set instructions can perform very complex operations, powerful assembly language variable instruction formats large number of addressing modes few registers machine instructions implemented with microcode RISC Reduced Instruction Set Computer relatively few instructions simple addressing modes, only load/store instructions access memory uniform instruction length many registers no microcode pipelined instruction execution Modern processors have developed further from the basic ideas behind RISC architecture 2 1

2 Post-RISC architecture Modern processors have developed further from the basic ideas behind RISC architecture exploit more instruction level parallelism Characteristics: parallel instruction execution (superscalar) deep pipeline (superpipelined) advanced branch prediction out-of-order instruction execution register renaming extended instruction set 3 Instruction Set Architecture An abstract description of a processor as it is seen by a (assembly language) programmer or compiler writer abstract model of a processor defines the instructions, registers and mechanisms to access memory that the processor can use to operate on data Specifies the registers machine instructions and their encoding memory addresses addressing modes Examples: Intel IA-32, Intel 64, AMD-64 defines a family of microprocessors, from the 8086 (1978) to the Intel Core i7 all binary compatible (within certain limits) Intel 64 and IA-32 Architectures Software Developer's Manuals, available at 4 2

3 Microarchitecture The microarchitecture of a processor defines how the ISA is implemented in hardware defines how the functionality of the ISA is implemented execution pipeline, functional units, memory organization,... Example (Intel processors) P6 microarchitecture from Intel Pentium Pro to Pentium III Netburst microarchitecture Pentium 4, Xeon Core microarchitecture Core 2, Xeon Nehalem microarchitecture Core i5, Core i7 The physical details (circuit layout, hardware construction, packaging, etc.) is an implementation of the microarchitecture two processors can have the same microarchitecture, but different hardware implementations for instance 90 nm transistor technology, 60 nm, 45 nm high-k metal gate technology or 32 nm technology 5 Instruction encoding Assembly language instructions are encoded into numerical machine instructions by the assembler Instruction formats can be of different types variable length supports varying number of operands typically used in CISC architectures: PDP 11, VAX, Motorola fixed format always the same number of operands addressing mode is specified as part of the opcode easy to decode, all instructions have the same form typically used in RISC architectures: SPARC, PowerPC, MIPS hybrid format multiple formats, depending on the operation used in most Intel and AMD processors: IA-32, Intel 64, AMD-64 machine instructions are split into micro-operations before they are executed 6 3

4 Assembly language instructions C code The instruction set specifies the machine instructions that the processor can execute expressed as assembly language instructions Instructions can have 2 or 3 operands add a,b a a + b, result overwrites a add c,a,b c a + b, result placed in c Nr of memory references in an instruction can be 0 load/store (RISC) 1 Intel x86 2 or 3 CISC architectures Translated to binary machine code (opcodes) by an assembler machine instructions can be of different lengths c = a + b! Assembly language register memory load a, R1! add b, R1! sto R1, c! Assembly language load store load a, R1! load b, R2! add R2, R1! sto R1, c! 7 Micro-operations Machine instructions are decoded into micro-operations (µops) before they are executed Simple instructions generate only one µop Example: ADD %RBX, %RAX add the register RBX to the register RAX More complex instructions generate several µops Example: ADD %R8, (MEM) add register R8 to the contents in the memory position with address MEM may generate 3 µops: load the value at address MEM into a register add the value in register R8 to the register store the register to the memory location at address MEM For complex addressing modes the effective address also has to be computed Micro-operations can be efficiently executed out-of-order in a pipelined fashion 8 4

5 Instruction pipelining Instruction execution is divided into a number of stages instruction fetch instruction decode execute memory writeback Instructions in IF ID X M Results out W The time to move an instruction one step through the pipeline is called a machine cycle can complete one instruction every cycle without pipelining we could complete one instruction every 5 cycles CPI clock Cycles Per Instruction the number of cycles needed to execute an instruction varies for different instructions 9 Pipelined execution Instruction fetch (IF) the next instruction if fetched from memory at the address pointed to by the Program Counter increment PC so that it points to the next instruction Instruction decode (ID) decode the instruction and identify which type it is immediate constant values are extended into 32/64 bits Execution (X) if it s an arithmetic operation, execute it in an ALU if it s a load/store the address of the operand is computed if it s a branch, set the PC to the destination address Memory access (M) if it s a load, fetch the content of the address from memory if it s a store, write the operand to the specified address in memory if it s neither a load nor store, do nothing Writeback (W) write the result of the operation to the destination register if it s a branch or a store, do nothing 10 5

6 Pipelined instruction execution All pipeline stages can execute in parallel Instruction Level Parallelism separate hardware units for each stage Successive instructions load a, R1! load b, R2! load c, R3! load d, R4! add R2, R1! add #1, R1! sub R3, R1! mul R4, R1! Clock cycles After 5 clock cycles, the pipeline is full finishes one instruction every clock period it takes 5 clock periods to complete one instruction Pipelining increases the CPU instruction throughput does not reduce the time to execute an individual instruction 11 Throughput and latency Throughput the number of instructions a pipeline can execute per time unit Latency the number of clock cycles it takes for the pipeline to complete the execution of an individual instruction Different instructions have different latency and throughput Pipeline examples: Pentium 3 has a 14-stage pipeline Pentium 4 has a 20-stage pipeline Core 2 (Nehalem) has a 14 stage pipeline AMD Opteron (Barcelona) has a 12 stage pipeline 12 6

7 Superscalar architecture Increases the ability of the processor to use instruction level parallelism Multiple instructions are issued every cycle multiple pipelines or functional units operating in parallel Example: 3 parallel pipelines each with 5 stages 3-way superscalar processor 3-issue processor Successive instructions instr. 1! instr. 2! instr. 3! instr. 4! instr. 5! instr. 6! instr. 7! instr. 8! instr. 9!...! Clock cycles 13 Pipeline hazards Pipelined execution is efficient if the flow of instructions through the pipeline can proceed without being interrupted Situations that prevent an instruction in the stream from executing during its clock cycle are called pipeline hazards hazards may force the pipeline to stall may have to stop the instruction fetch for a number of cycles, until all the resources that are needed become available also called a pipeline bubble Structural hazards caused by resource conflicts two instructions need the same hardware unit in the same pipeline stage Data hazards arise when an instruction depends on the result of a previous instruction, which has not completed yet Control hazards caused by branches in the instruction stream 14 7

8 Structural hazards Caused by resource conflicts the hardware can not simultaneously execute two instructions that need access to the same (single) functional unit for instance, if instructions and data are fetched from the same memory port Can be avoided by duplicating functional units or access paths to memory pipelining functional units stalling the instruction execution for at least one cycle creates a pipeline bubble 15 Data hazards An instruction depends on the result of a previous instruction, which has not completed yet caused by dependences among the data different types of data hazards: read-after-write, write-after-write and write-afterread Example: the loads write the values into the register in the write-back stage R1 will be ready in cycle 4 R2 will be ready in cycle 5 The add must stall until both R1 and R2 are ready Can be avoided by forwarding and register renaming load a, R1! load b, R2! add R1,R2! Clock cycle IF ID X M WB IF ID X M WB 7 8 IF ID X M WB 16 8

9 Control hazards Branch instructions transfer control in the program execution may assign a new value to the PC Conditional branches may be taken or not taken a taken branch assigns the target address to the PC a branch that is not taken (which falls through) continues at the next instruction The instruction is recognized as a branch in the instruction decode phase can decide whether the branch will be taken or not in the execute stage the next instruction has to stall Can be avoided by branch prediction...! jnz L1! add #1,R2! sub R4,R3! L1:! mov #0,R1! jnz L1! add #1,R2! IF ID X M WB sub R4,R3! IF ID X M Clock cycle WB IF ID X M WB 17 Dependence Pipeline hazards are caused by dependences in the code limit the amount of instruction level parallelism that can be used Can avoid hazards (and pipeline stalls) in the execution of a program by using more advanced instruction execution mechanisms forwarding register renaming instruction scheduling branch prediction dynamic instruction execution Can also eliminate some dependences by code transformations formulate the program in an alternative way, avoiding some dependences 18 9

10 Data and control dependence Data dependence data must be produced and consumed in the correct order in the program execution Definition: two statements s and t are data dependent if and only if both statements access the same memory location and at least one of them stores into it, and there is a feasible run-time execution path from s to t Control dependence determines the ordering of instructions with respect to branches Example: if p1!! then s1!! else s2;! s1 and s2 are control dependent on p1 we have to first execute p1 before we know which of s1 or s2 should be executed 19 Data dependence Three types of data dependences true dependence anti-dependence output dependence Anti-dependence and output dependence are called name dependences two instructions use the same register or memory locations, but there is no actual flow of data between the two instructions no real dependence between data, only between the names, i.e. the registers or memory locations (variables) that are used to hold the data 20 10

11 True dependence An instruction depends on data from a previous instruction the first statement stores into a location that is later read by the second statement can not execute the statements in the reverse order can not execute the statements simultaneously in the pipeline without causing a stall x = a*2;! y = x+1;! Corresponds to a Read After Write (RAW) data hazard between the two instructions 21 Antidependence Anti-dependence the first instruction reads from a location into which the second statement stores y = x+a;! x = b;! Corresponds to a Write After Read (WAR) hazard between the two instructions No value is transmitted between the two statements can be executed simultaneously if we choose another name for x in the assignment statement x=b; Can be avoided by using register renaming use different registers for the variable x in the two statements 22 11

12 Output dependence Output dependence two instructions write to the same register or memory location Corresponds to a Write After Write (WAW) hazard between the two instructions No value is transmitted between the two statements x = x+1;!...! x = b;! can be executed simultaneously if we choose another name for one of the references to x Can be avoided by using register renaming 23 Control dependence Control dependence determine the order of instructions with respect to branches (jumps) in the code if p evaluates to TRUE, the instruction s1 is executed if p evaluates to FALSE, the instruction s1 is not executed, but is branched over! s0;! if (p)! then s1;! s2;! Instruction that are control dependent on a branch can not be moved before the branch instructions from the then-part of an if-statement can not be executed before the branch Instructions that are not control dependent on a branch can not be moved into the then-part Can avoid hazards by using branch prediction speculative execution 24 12

13 Branch prediction To avoid stalling the pipeline when branch instructions are executed, branch prediction is used it is very important to have a good branch prediction mechanism, since branches are very common in most programs Example: 20% branch instructions in SPECint92 benchmark Branch prediction is only needed for conditional branches, unconditional branches are always taken subroutine calls and goto-statements are always taken returns from subroutines need to be predicted Two types of branch prediction mechanisms: static uses fixed rules for guessing how branches will go dynamic collects information about how the branches have behaved in the past, and use that to predict how the branch will go the next time 25 Mispredicted branches When a misprediction occurs, the processor has executed instructions that should not be executed it has to undo the effects of the falsely executed instructions It is not allowed to change the state of the processor until the branch outcome is known no writeback can be done before the outcome of the branch is ready The instructions that were executed because of a mispredicted branch have to be undone flush out the mispredicted instructions from the pipeline restart the instruction fetch from the correct branch target The performance penalty of a mispredicted branch is typically as many clock cycles as the length of the pipeline if (f(x)>n)! x = 0;! else! x = 1;!...! 26 13

14 Static branch prediction Fixed rules for predicting how a branch will behave the prediction is not based on the earlier behavior of the branch guess the outcome of the branch, and continue the execution with the predicted instruction Predict as taken / not taken the prediction is the same for all branches Direction-based prediction backward branches are taken forward branches are not taken success rate is about 65% for (i=0; i<n; i++)! {! X[i] = 0;! )! y=1;!...! 27 Dynamic branch prediction Static branch prediction does not consider the previous outcomes of branches branches often show regular and repetitive patterns In dynamic branch prediction the prediction is based on the outcome of previous executions of the branch collect branch history information, on which we base the prediction In practice it is not possible to store information about all branches in a program there is no upper limit on the number of branches in a program use a small and fast internal memory buffer to store information about branch outcomes 28 14

15 Branch history Branch history information is collected in a small cache memory called the branch history table indexed by a number of bits from the branch instruction address contains branch history information (taken/not taken) two branches may use the same table entry, which may lead to incorrect predictions Stores the outcome of the most recent branch executions need at least one bit in each table entry to store the outcome of the branch (taken / not taken) if no branch history exists, use static prediction The branch history information is used to fetch the next instruction from the predicted target done in the instruction fetch/decode phase in the pipeline 29 Branch target buffer The branch target buffer also stores the target address of the branch we can find the target address of the branch already in the instruction fetch phase if the branch is predicted as taken, we can immediately start fetching instructions from the branch target address Implemented by associative memory Branch prediction taken / not taken Branch instruction Only need to store information address about branches that are predicted to be taken if we find an entry in the table, it is predicted as taken otherwise, we predict it as not taken and continue execution with the following instruction used with one-bit branch history Branch target address 30 15

16 One bit branch history Predict that the branch goes the same way as the last time it was executed if the prediction turns out to be wrong, invert the prediction bit Mispredicts both the first and the last iteration of a loop misprediction of the last iteration is inevitable, since the branch has been taken N-1 times (in a loop of length N) after executing the last, mispredicted, iteration of the loop the prediction bit is inverted causes a misprediction in the first iteration when we execute the loop the next time T Predict taken NT Predict not taken NT T 31 2-bit branch history taken Use two bits to store the branch history a prediction must miss twice before it is changed gives four different states 00 strongly not taken 01 weakly not taken 10 weakly taken 11 strongly taken can be implemented by a two-bit counter with saturation arithmetic not taken Strong taken Weak taken Weak not taken Strong not taken taken not taken Can generalize this scheme to N bits, but in practice 2 bits is enough for accurate branch prediction 32 16

17 Two-level adaptive branch prediction The first level consists of a 4-bit shift register stores the history (taken / not taken) of the four last events of a branch instruction called a branch history register The second level consists of 16 2-bit counters 2-bit branch pattern history states as previously described incremented when branch taken, decremented when not taken The first level register is used as an index to select one of 16 pattern history states (m,n)-correlation based predictor use the m last branches to select one of 2 m predictors the predictors are n-bit Can predict very complex repetitive patterns... all repetitive patterns with a period of five or less Local and global branch prediction Local branch predictors stores a separate history for each branch captures only the behavior of one branch Global branch predictors include information about several closely located branches captures the behavior of a group of (possibly) correlated branches Example: when the first if-statement is TRUE, the second will also be TRUE if (d==0)! d=1;! if (d==1)!...! 34 17

18 Tournament branch predictor Multilevel branch predictor, which uses more than one branch prediction method chooses the method that previously has given the best result for each branch stores for each branch a selector to one of the methods, for instance using a 1-bit saturating counter Combines both local and global predictors, and uses the method that is best for each branch Requires more resources than simpler branch prediction methods more internal tables to maintain more complex decision mechanism 35 Predicting call/returns Procedure calls are unconditional branches always taken procedure returns need to be predicted Procedure calls and returns are paired one return for each procedure call can have nested procedure calls Use a return address stack (RAS) as a branch target buffer to predict the return address push the return address when the call instruction is executed pop it when the return instruction is executed Using advanced adaptive branch prediction mechanisms of this kind, it is possible to achieve up to 95% accuracy performance depends strongly on the code 36 18

19 Register renaming The instruction set architecture defines a set of logical registers visible to the (assembly language) programmer general-purpose registers (EAX, EBX, ECX, EDX,...) special registers (IP, SP, EFLAGS,...) The pipelined execution uses a much larger set of internal physical registers in the program execution called the register file register renaming dynamically associates logical registers to physical registers Arcitectural registers eliminates name dependences R0 The dynamic instruction execution R1 R2 mechanism implements register R3 renaming by the use of reservation R4 R5 stations Physical registers 37 Forwarding Also named short circuiting The result from an instruction is available in internal registers of the functional units after the execution stage no need to wait until the writeback stage to use the result in subsequent operations can use the results before the writeback stage has been completed load a,r0! add #1,R0! add R0,R1! Clock cycle IF ID X M WB IF ID X M WB IF ID X M WB 38 19

20 In-order execution Instructions are executed in program order limits the opportunities for instruction level parallelism Dependences between instructions force them to be executed in program order the add instruction uses the value loaded into R1 the store instruction uses the value produced by the add in R0 the sub instruction modifies the value in R0 the branch condition depends on R0 L1:! load (R0),R1! add R2,R1! sto R1,(R0)! sub #4,R0! jnz L1! The chain of dependences can be as long as the entire program 39 Out-of-order instruction execution To be able to use more instruction level parallelism the processor may execute instructions out of order also called dynamic instruction execution or dynamic instruction scheduling Out-of-order program execution has to produce the same result as inorder execution instructions that are independent of each other can execute in any order (or in parallel) instructions that are dependent can not be reordered or executed in parallel The out-of-order execution mechanism has to be very conservative and guarantee that it does not change the result of the program 40 20

21 Out-of-order instruction execution (cont.) In dynamic instruction scheduling instructions are dynamically rearranged in order to keep the pipeline busy The instruction execution is allowed to rearrange the instructions to avoid hazards and keep the pipeline full executes the code in a more efficient way allows code optimized for one processor to execute efficiently on another processor Example: div R1,R0! the add depends on the result of the division add R0,R2! the load-instruction stalls until the div and! add are ready load a,r5! sub R6,R5! No dependences between div/add and load/sub can execute the load/sub before the div/add, or both in parallel 41 Out-of-order execution The instruction fetch, execution and retirement are separated from each other in the pipeline instructions are fetched and decoded in order instructions are executed out of order results of the execution are retired (or written back) in order An instruction can be executed when all its operands are available, and a functional unit for the operation is available result of execution is stored in internal registers retired in program order, written back to registers or memory Instruction fetch Instruction decode and rename Instruction window Issue RS Execution RS Execution Retire Write back 42 21

22 Tomasulo s algorithm Method for dynamic instruction scheduling implements out-of-order instruction execution R.M. Tomasulo, An Efficient Algorithm for Exploiting Multiple Arithmetic Units, IBM J. of Res.&Dev. 11:1 (Jan 1967) developed for IBM 360/91 Similar out-of-order execution mechanisms are used in most current processors Avoids pipeline stalls due to dependences instructions whose operands are available can execute out of order Combines register renaming out-of-order instruction execution data forwarding (short circuiting) 43 Reservation stations Buffers for operands of instructions that are waiting to be issued fetches and stores operands as soon as they are available replaces the registers in the instruction execution implements register renaming operands are identified by a unique tag if an operand has not yet been computed, the reservation station designates which other reservation station will produce the operand Eliminates the need to fetch/write operands from/to registers don t have to write results back to registers which will be immediately read by another instruction implements register renaming performs the same function as forwarding (short-circuiting) There are more reservation stations than registers 44 22

23 Reservation stations (cont.) Reservation stations, functional units and load/store buffers are connected by a Common Data Bus (CDB) memory access (load/store) are treated as functional units When the operands of an instruction are available, the instruction can be sent to a functional unit for execution Results of execution are broadcasted on the CDB Reservation stations listen to the CDB for operand values Functional unit Common Data Bus if a matching tag is seen on the bus, the RS copies the value into its operand field all reservation stations waiting for the value are updated at the same time Implemented by associative memory can immediately identify a tag value on the CDB that it is waiting for Reservation stations 45 Organization of Tomasulo s algorithm From memory Instruction fetch Registers Load buffer Reorder buffer Instructions Store buffer Reservation stations To memory Adder Multiplier Common Data Bus 46 23

24 Data structures in Tomasulos algorithm Have to store data describing the state of instructions in reservation stations, registers and load/store buffers Tags identify entries in reservation stations used as names for an extended set of registers points to the reservation station that holds or that will produce a result needed as an operand in an instruction also used to order the executed instructions in program order when they are retired Issued instructions refer to the operands by tag values not by the register names Registers need one additional field the tag of the reservation station that will produce the result to be stored in this register if zero, no currently active instruction is computing a result destined for this register 47 Stages in Tomasulo s algorithm Issue get the next instruction from the instruction queue get a free reservation station and assign the instruction and its operands to it, if they are available in registers if the operands are not available in registers, assign as operands the reservation stations that will produce them if there is no free reservation station, there is a structural hazard and the instruction must stall until one becomes available Execution if the operands are not ready, monitor the CDB and wait for the operands to be computed when the operands are ready, place them into the corresponding reservation stations dispatch the instruction to the functional unit for execution 48 24

25 Stages in Tomasulo s algorithm Write result after an instruction is executed, broadcast the result on the CDB all reservation stations that are waiting for this as an operand will receive it mark the reservation station as free Instruction retirement is done in program order has to produce exactly the same result as in in-order execution Loads and stores may require additional steps, as they also have to compute the effective address bit architectures Today, most server, desktop and even portable computers are equipped with a 64-bit processor can run both in 32- and 64-bit mode Intel 64 and AMD64 are 64-bit Instruction Set Architectures compatible with each other, some minor differences collectively referred to as x86-64 ISA backwards compatible with the 32-bit ISA IA-32 Other 64-bit architectures Intel IA-64, Alpha, Sun SPARC64, IBM Power, HP PA-RISC, MIPS Most modern operating systems run in 64-bit mode Most compilers can generate 64-bit code 50 25

26 Advantages of 64-bit architectures Long integers are 64 bits can represent integer values from to dynamic range is about 1.8 * Pointers are 64 bits can theoretically address 16 Exabyte of main memory (that is about bytes) current 64-bit systems use 48 address bits can support 256 TB of memory Floating-point values are represented the same way both in 32- and 64-bit architectures defined by the IEEE 754 standard float: 32 bits, double: 64 bits, long double: 80 bits All other numerical values are stored in the same way in 32- and 64-bit architectures bit extended register sets In the x86-64 architecture the number of general-purpose registers is extended from 8 to 16 also extended in length from 32 to 64 bits similarly, the number of XMM registers are extended from 8 to bit architectures have more registers much more of the program state can be kept is fast registers less need to access local variables on the stack more procedure arguments (up to 6) can be passed in registers More efficient procedure call mechanism less register pressure compilers can generate more compact code 52 26

27 Other advantages The 64-bit architectures also support new instructions that may not be included in 32-bit architectures compilers for 64-bit architectures have better support for new instructions conditional move instruction floating-point operation with the SSE unit does not always need to keep a stack frame for procedure calls Disadvantage pointers need twice the amount of memory, 8 bytes instead of 4 bytes 53 Multi-core architectures Multi-core processors Dual-core: IBM POWER 5, AMD Opteron, Intel Core 2 Duo Quad-core: Intel Core 2 Quad, AMD Opteron (Barcelona) Six-core: AMD Opteron (Istanbul), Intel Xeon (Westmere) Twelve-core: AMD Opteron (Magny-Cours) Future high-end processors will have a multi-core design need parallel programming to take advantage of multi-core and hyperthreading architectures threaded programming or messagepassing (MPI) 54 27

28 Hyperthreading Hyperthreading or simultaneous multithreading two (or more) software threads can execute simultaneously on one processor core the processor architecture contains hardware support for efficient execution of threads Improves pipeline efficiency if one thread has to stall, an other thread can use the processor cycles that otherwise would be idle Makes more efficient use of the physical execution resources uses task-level parallelism to increase the utilization of the execution resources Introduced in the Intel Pentium 4 processor in 2002 re-introduced in the Core i7 and Atom processors 55 Implementation of hyperthreading A single CPU appears as two logical processors the architectural state is duplicated (program counter, registers, status flags,... ) both logical processors share the same physical execution resources State The processor fetches and decodes instructions alternating between the two threads if one thread is blocked, the other gets full instruction fetch bandwidth The processor can simultaneously execute instructions from 2 threads State Processor execution resources instructions from both threads flow simultaneously through the pipeline the out-of-order execution mechanism does not need to know from which thread an instruction is 56 28

29 Processor architectures for embedded systems Processors for embedded systems are intended to be used in consumer products, wireless and networking devices, handheld and mobile devices, automotive industry, etc. Puts very special requirements on the processor architecture broad range of performance requirements from very small 8- or 16-bit processors to 32- or 64-bit processors with advanced signal processing capabilities must be available in a large number of different configurations low power consumption and heat dissipation low cost, high flexibility available also as a processor core which can be integrated with other hardware on the same chip small physical footprint (nr. of transistors) 57 Embedded systems processors Example of an ISA for embedded processors: ARMv6 backwards compatible with previous generations of ARM processors Characteristics of the microarchitecture short pipeline low clock frequency scalar processor design issues one instruction in each clock cycle, in-order execution a few separate functional units (e.g. ALU, Multiply/Add and Load/Store) simple branch prediction mechanism (static and dynamic) small cache size can choose between a number of alternative cache sizes SIMD instructions for media processing 58 29

30 Code optimization for out-of-order execution and branches High-level code optimization techniques that may improve pipelined execution and execution of branches arrange the code in large blocks of linear code break long dependency chains in the code avoid code with too many branches Create better possibilities of the out-of-order execution to proceed without interruption The compiler can automatically do low-level optimizations more advanced optimizations have to be done by the programmer on source code level Branches are unavoidable in programs branches are not inefficient, provided that they can be correctly predicted if the code has too many branches, they may exceed the capacity of the branch history buffers and lead to mispredictions 59 Avoiding branches Make sure that the compiler can generate conditional move instructions older compiler versions may not use the cmove instructions newer versions of GCC generates conditional moves The conditional expression in C/C++ is likely to be compiled with a cmove Avoid random branches, if possible Avoid indirect jumps and calls jump tables, function pointers Avoid very deep nesting of subroutines otherwise the Return Address Stack may overflow use iterative functions instead of recursive, if possible int max(int a, int b)! {! if (a>b) return a;! else return b;! }! int max(int a, int b)! {! return (a>b)? a : b;! }! 60 30

31 Branch density If possible, avoid code that contains too many branches avoid complex logical expressions that generate dense conditional branches, especially if the branch bodies are small in the AMD Opteron, more than three branches in a 16-byte code block leads to resource conflicts in the branch target buffer causes unnecessary branch misprediction Branches can be eliminated by using conditional move or conditional set instructions it may also be possible to rewrite complex branches with assembly language code that uses conditional moves 61 Order of evaluation in Boolean expressions C and C++ uses short-circuit evaluation for compound Boolean expressions in a Boolean expression (a OP b), the second argument is not evaluated if the first argument alone determines the value of the expression if a evaluates to TRUE in an expression if (a b), then b is not evaluated if a evaluates to FALSE in an expression if (a&&b), then b is not evaluated If one of the expressions is known to be true more often than the other, arrange the expressions so that the evaluation is shortened if a is known to be TRUE 60% of the time and b is TRUE 10% of the time then you should arrange them as (b&&a) and (a b) If one expression is more predictable, place that first If one expression is much faster to calculate, place that first If the Boolean expressions have side effects or are dependent, they can not be necessarily be rearranged 62 31

32 Avoid unnecessary branches Use if-else-if constructs to avoid branches if the cases are mutually exclusive no need to evaluate all if-statements A switch statement is even better A table lookup can sometimes also be used no branch instructions in the generated assembly code double select(int a) {! double result;! if (a==0) result = 1.13;! if (a==1) result = 2.56;! if (a==2) result = 3.67;! if (a==3) result = 4.16;! if (a==4) result = 8.12;! return result;! }! double select(int a) {! double result;! if (a==0) result = 1.13;! else if (a==1) result = 2.56;! else if (a==2) result = 3.67;! else if (a==3) result = 4.16;! else if (a==4) result = 8.12;! return result;! }! double select(int a) {! double result[5] = {1.13, 2.56, 3.67, 4.16, 8.12};! return result[a];! }! 63 Order of branches Order branches in if- and switch-statements so that the most likely case comes first the other cases do not have to be evaluated if the first one is TRUE Use contiguously numbered case expressions, if possible the compiler can translate the switch-statement into a jump table if they are non-contiguous, use a series of if-else statements instead switch (value) {/* Most likely case first */! case 0: handle_0(); break;! case 1: handle_1(); break;! case 2: handle_2(); break;! case 3: handle_3(); break;! }! if (a==0) {! /* Handle case for a==0 */! }! else if (a==8) {! /* Handle case for a==8 */! }! else {! /* Handle default case */! }! 64 32

33 Loop unswitching Move loop-invariant conditional constructs out of the loop if- or switch-statements which are independent of the loop index can be moved outside of the loop the loop is instead repeated in the different branches of the if- or switchstatement removes branch instructions from within the loop Removes branch instructions, increases instruction level parallelism, improves possibilities to parallelize the loop but increases the amount of code for (i=0; i<n; i++)! { if (a>0)! X[i] = a;! else! X[i] = 0;! }! if (a>0)! { for (i=0; i<n; i++)! X[i] = a;! }! else! { for (i=0; i<n; i++)! X[i] = 0;! }! 65 Breaking up dependences Long dependency chains may slow down pipelined execution each instruction is dependent on data produced in the previous instruction Try to break up dependencies by computing partial results that are independent of each other compute parts of a complex expression in separate temporary variables Floating-point expressions are evaluated in order specified in the source code b+c+d+e is evaluated as ((b+c)+d)+e Can explicitly specify the associative order of the computation may affect the precision for floatingpoint expressions (but not for integer) double a, b, c, d, e;! a = b+c+d+e;! double a, b, c, d, e;! a = (b+c)+(d+e);! 66 33

34 Loop unrolling Repeat the body of a loop k times and go through the loop with a step length k k is called the unrolling factor reduces branches, enables better instruction scheduling In general, we can not assume that the number of iterations (N) is divisible by the unrolling factor (k) have to modify the loop limit and take care of remaining elements for (i=0; i<n; i++)! sum += X[i];! const int k=5; /* Unrolling factor */! int limit = length-(k-1);! sum=0.0;! for (i=0; i<limit; i+=k) {! sum += X[i];! sum += X[i+1}; /* Unroll by 5 */! sum += X[i+2];! sum += X[i+3];! sum += X[i+4];! }! /* Finish remaining elements */! for (; i<length; i++)! sum += X[i];! 67 Loop unrolling, breaking dependences Can also break up dependences by computing partial results in the unrolled loop use 5 separate partial sums computation of sum0 sum4 are independent and can be done at the same time Increases register use if some of the variables have to be kept I memory instead of in register, the code may become slower const int k=5;! int limit = length-(k-1);! sum0=sum1=sum2=sum3=sum4=0.0;! for (i=0; i<limit; i+=k) {! sum0 += X[i];! sum1 += X[i+1];! sum2 += X[i+2];! sum3 += X[i+3];! sum4 += X[i+4];! }! for (; i<length; i++)! sum0 += X[i];! sum0 += sum1+sum2+sum3+sum4;! 68 34

35 Unrolling small loops Small loops, with a fixed loop count and a small loop body can be completely unrolled Example: perspective projection in 3D computer graphics /* 3D transform! Multiply the vector v with a 4x4 transformation matrix m */! for (i=0; i<4; i++) {! r[i] = 0;! for (j=0; j<4; j++) {! r[i] = m[i][j] * v[j];! }! }! r[0] = m[0][0]*v[0] + m[1][0]*v[1] + m[2][0]*v[2] + m[3][0]*v[3];! r[1] = m[0][1]*v[0] + m[1][1]*v[1] + m[2][1]*v[2] + m[3][1]*v[3];! r[2] = m[0][2]*v[0] + m[1][2]*v[1] + m[2][2]*v[2] + m[3][2]*v[3];! r[3] = m[0][3]*v[0] + m[1][3]*v[1] + m[2][3]*v[2] + m[3][3]*v[3];!! 69 35

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley Computer Systems Architecture I CSE 560M Lecture 10 Prof. Patrick Crowley Plan for Today Questions Dynamic Execution III discussion Multiple Issue Static multiple issue (+ examples) Dynamic multiple issue

More information

Handout 2 ILP: Part B

Handout 2 ILP: Part B Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP

More information

Superscalar Processors Ch 14

Superscalar Processors Ch 14 Superscalar Processors Ch 14 Limitations, Hazards Instruction Issue Policy Register Renaming Branch Prediction PowerPC, Pentium 4 1 Superscalar Processing (5) Basic idea: more than one instruction completion

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation 1 Branch Prediction Basic 2-bit predictor: For each branch: Predict taken or not

More information

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency? Superscalar Processors Ch 14 Limitations, Hazards Instruction Issue Policy Register Renaming Branch Prediction PowerPC, Pentium 4 1 Superscalar Processing (5) Basic idea: more than one instruction completion

More information

Instruction Level Parallelism

Instruction Level Parallelism Instruction Level Parallelism The potential overlap among instruction execution is called Instruction Level Parallelism (ILP) since instructions can be executed in parallel. There are mainly two approaches

More information

Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions

More information

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism

More information

Control Hazards. Prediction

Control Hazards. Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown

More information

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

ASSEMBLY LANGUAGE MACHINE ORGANIZATION ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction

More information

Static vs. Dynamic Scheduling

Static vs. Dynamic Scheduling Static vs. Dynamic Scheduling Dynamic Scheduling Fast Requires complex hardware More power consumption May result in a slower clock Static Scheduling Done in S/W (compiler) Maybe not as fast Simpler processor

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 4

ECE 571 Advanced Microprocessor-Based Design Lecture 4 ECE 571 Advanced Microprocessor-Based Design Lecture 4 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 January 2016 Homework #1 was due Announcements Homework #2 will be posted

More information

5008: Computer Architecture

5008: Computer Architecture 5008: Computer Architecture Chapter 2 Instruction-Level Parallelism and Its Exploitation CA Lecture05 - ILP (cwliu@twins.ee.nctu.edu.tw) 05-1 Review from Last Lecture Instruction Level Parallelism Leverage

More information

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor 1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic

More information

Metodologie di Progettazione Hardware-Software

Metodologie di Progettazione Hardware-Software Metodologie di Progettazione Hardware-Software Advanced Pipelining and Instruction-Level Paralelism Metodologie di Progettazione Hardware/Software LS Ing. Informatica 1 ILP Instruction-level Parallelism

More information

COMPUTER ORGANIZATION AND DESI

COMPUTER ORGANIZATION AND DESI COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler

More information

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5) Instruction-Level Parallelism and its Exploitation: PART 1 ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5) Project and Case

More information

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Chapter 4 The Processor 1. Chapter 4D. The Processor

Chapter 4 The Processor 1. Chapter 4D. The Processor Chapter 4 The Processor 1 Chapter 4D The Processor Chapter 4 The Processor 2 Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline

More information

Superscalar Processors

Superscalar Processors Superscalar Processors Superscalar Processor Multiple Independent Instruction Pipelines; each with multiple stages Instruction-Level Parallelism determine dependencies between nearby instructions o input

More information

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines CS450/650 Notes Winter 2013 A Morton Superscalar Pipelines 1 Scalar Pipeline Limitations (Shen + Lipasti 4.1) 1. Bounded Performance P = 1 T = IC CPI 1 cycletime = IPC frequency IC IPC = instructions per

More information

UNIT- 5. Chapter 12 Processor Structure and Function

UNIT- 5. Chapter 12 Processor Structure and Function UNIT- 5 Chapter 12 Processor Structure and Function CPU Structure CPU must: Fetch instructions Interpret instructions Fetch data Process data Write data CPU With Systems Bus CPU Internal Structure Registers

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation Introduction Pipelining become universal technique in 1985 Overlaps execution of

More information

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor Single-Issue Processor (AKA Scalar Processor) CPI IPC 1 - One At Best 1 - One At best 1 From Single-Issue to: AKS Scalar Processors CPI < 1? How? Multiple issue processors: VLIW (Very Long Instruction

More information

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring

More information

Complex Pipelines and Branch Prediction

Complex Pipelines and Branch Prediction Complex Pipelines and Branch Prediction Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. L22-1 Processor Performance Time Program Instructions Program Cycles Instruction CPI Time Cycle

More information

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3) Hardware Speculation and Precise

More information

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenković, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Tomasulo

More information

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors William Stallings Computer Organization and Architecture 8 th Edition Chapter 14 Instruction Level Parallelism and Superscalar Processors What is Superscalar? Common instructions (arithmetic, load/store,

More information

Control Hazards. Branch Prediction

Control Hazards. Branch Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism CS 252 Graduate Computer Architecture Lecture 4: Instruction-Level Parallelism Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://wwweecsberkeleyedu/~krste

More information

Advanced Computer Architecture

Advanced Computer Architecture Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata What is a Processors Architecture Instruction Set Architecture (ISA) Describes

More information

The Processor: Instruction-Level Parallelism

The Processor: Instruction-Level Parallelism The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Instruction

More information

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Computer Architectures 521480S Dynamic Branch Prediction Performance = ƒ(accuracy, cost of misprediction) Branch History Table (BHT) is simplest

More information

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel

More information

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?) Evolution of Processor Performance So far we examined static & dynamic techniques to improve the performance of single-issue (scalar) pipelined CPU designs including: static & dynamic scheduling, static

More information

Processor: Superscalars Dynamic Scheduling

Processor: Superscalars Dynamic Scheduling Processor: Superscalars Dynamic Scheduling Z. Jerry Shi Assistant Professor of Computer Science and Engineering University of Connecticut * Slides adapted from Blumrich&Gschwind/ELE475 03, Peh/ELE475 (Princeton),

More information

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5 EN164: Design of Computing Systems Lecture 24: Processor / ILP 5 Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis6627 Powerpoint Lecture Notes from John Hennessy

More information

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring

More information

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections ) Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target

More information

Advanced Computer Architecture

Advanced Computer Architecture Advanced Computer Architecture 1 L E C T U R E 4: D A T A S T R E A M S I N S T R U C T I O N E X E C U T I O N I N S T R U C T I O N C O M P L E T I O N & R E T I R E M E N T D A T A F L O W & R E G I

More information

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

EITF20: Computer Architecture Part3.2.1: Pipeline - 3 EITF20: Computer Architecture Part3.2.1: Pipeline - 3 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Dynamic scheduling - Tomasulo Superscalar, VLIW Speculation ILP limitations What we have done

More information

Adapted from David Patterson s slides on graduate computer architecture

Adapted from David Patterson s slides on graduate computer architecture Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Basic Compiler Techniques for Exposing ILP Advanced Branch Prediction Dynamic Scheduling Hardware-Based Speculation

More information

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 12 Processor Structure and Function

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 12 Processor Structure and Function William Stallings Computer Organization and Architecture 8 th Edition Chapter 12 Processor Structure and Function CPU Structure CPU must: Fetch instructions Interpret instructions Fetch data Process data

More information

Instruction Level Parallelism

Instruction Level Parallelism Instruction Level Parallelism Software View of Computer Architecture COMP2 Godfrey van der Linden 200-0-0 Introduction Definition of Instruction Level Parallelism(ILP) Pipelining Hazards & Solutions Dynamic

More information

Processor (IV) - advanced ILP. Hwansoo Han

Processor (IV) - advanced ILP. Hwansoo Han Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle

More information

Keywords and Review Questions

Keywords and Review Questions Keywords and Review Questions lec1: Keywords: ISA, Moore s Law Q1. Who are the people credited for inventing transistor? Q2. In which year IC was invented and who was the inventor? Q3. What is ISA? Explain

More information

Lecture-13 (ROB and Multi-threading) CS422-Spring

Lecture-13 (ROB and Multi-threading) CS422-Spring Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue

More information

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011 5-740/8-740 Computer Architecture Lecture 0: Out-of-Order Execution Prof. Onur Mutlu Carnegie Mellon University Fall 20, 0/3/20 Review: Solutions to Enable Precise Exceptions Reorder buffer History buffer

More information

Uniprocessors. HPC Fall 2012 Prof. Robert van Engelen

Uniprocessors. HPC Fall 2012 Prof. Robert van Engelen Uniprocessors HPC Fall 2012 Prof. Robert van Engelen Overview PART I: Uniprocessors and Compiler Optimizations PART II: Multiprocessors and Parallel Programming Models Uniprocessors Processor architectures

More information

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 15: Instruction Level Parallelism and Dynamic Execution March 11, 2002 Prof. David E. Culler Computer Science 252 Spring 2002

More information

Advanced processor designs

Advanced processor designs Advanced processor designs We ve only scratched the surface of CPU design. Today we ll briefly introduce some of the big ideas and big words behind modern processors by looking at two example CPUs. The

More information

CS146 Computer Architecture. Fall Midterm Exam

CS146 Computer Architecture. Fall Midterm Exam CS146 Computer Architecture Fall 2002 Midterm Exam This exam is worth a total of 100 points. Note the point breakdown below and budget your time wisely. To maximize partial credit, show your work and state

More information

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University Advanced d Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000

More information

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by: Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by: Result forwarding (register bypassing) to reduce or eliminate stalls needed

More information

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW) 1 EEC 581 Computer Architecture Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW) Chansu Yu Electrical and Computer Engineering Cleveland State University Overview

More information

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2. Instruction-Level Parallelism and its Exploitation: PART 2 Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.8)

More information

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , ) Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections 3.4-3.5, 3.8-3.14) 1 1-Bit Prediction For each branch, keep track of what happened last time and use

More information

LECTURE 3: THE PROCESSOR

LECTURE 3: THE PROCESSOR LECTURE 3: THE PROCESSOR Abridged version of Patterson & Hennessy (2013):Ch.4 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU

More information

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture CS425 Computer Systems Architecture Fall 2017 Multiple Issue: Superscalar and VLIW CS425 - Vassilis Papaefstathiou 1 Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

Superscalar Processors Ch 13. Superscalar Processing (5) Computer Organization II 10/10/2001. New dependency for superscalar case? (8) Name dependency

Superscalar Processors Ch 13. Superscalar Processing (5) Computer Organization II 10/10/2001. New dependency for superscalar case? (8) Name dependency Superscalar Processors Ch 13 Limitations, Hazards Instruction Issue Policy Register Renaming Branch Prediction 1 New dependency for superscalar case? (8) Name dependency (nimiriippuvuus) two use the same

More information

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version:

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version: SISTEMI EMBEDDED Computer Organization Pipelining Federico Baronti Last version: 20160518 Basic Concept of Pipelining Circuit technology and hardware arrangement influence the speed of execution for programs

More information

Advanced issues in pipelining

Advanced issues in pipelining Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one

More information

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism

More information

ILP: Instruction Level Parallelism

ILP: Instruction Level Parallelism ILP: Instruction Level Parallelism Tassadaq Hussain Riphah International University Barcelona Supercomputing Center Universitat Politècnica de Catalunya Introduction Introduction Pipelining become universal

More information

Multi-cycle Instructions in the Pipeline (Floating Point)

Multi-cycle Instructions in the Pipeline (Floating Point) Lecture 6 Multi-cycle Instructions in the Pipeline (Floating Point) Introduction to instruction level parallelism Recap: Support of multi-cycle instructions in a pipeline (App A.5) Recap: Superpipelining

More information

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU 1-6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU Product Overview Introduction 1. ARCHITECTURE OVERVIEW The Cyrix 6x86 CPU is a leader in the sixth generation of high

More information

EECC551 Exam Review 4 questions out of 6 questions

EECC551 Exam Review 4 questions out of 6 questions EECC551 Exam Review 4 questions out of 6 questions (Must answer first 2 questions and 2 from remaining 4) Instruction Dependencies and graphs In-order Floating Point/Multicycle Pipelining (quiz 2) Improving

More information

Super Scalar. Kalyan Basu March 21,

Super Scalar. Kalyan Basu March 21, Super Scalar Kalyan Basu basu@cse.uta.edu March 21, 2007 1 Super scalar Pipelines A pipeline that can complete more than 1 instruction per cycle is called a super scalar pipeline. We know how to build

More information

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling) 18-447 Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling) Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 2/13/2015 Agenda for Today & Next Few Lectures

More information

Pipelining, Branch Prediction, Trends

Pipelining, Branch Prediction, Trends Pipelining, Branch Prediction, Trends 10.1-10.4 Topics 10.1 Quantitative Analyses of Program Execution 10.2 From CISC to RISC 10.3 Pipelining the Datapath Branch Prediction, Delay Slots 10.4 Overlapping

More information

Instruction Level Parallelism (ILP)

Instruction Level Parallelism (ILP) 1 / 26 Instruction Level Parallelism (ILP) ILP: The simultaneous execution of multiple instructions from a program. While pipelining is a form of ILP, the general application of ILP goes much further into

More information

Advanced Processor Architecture

Advanced Processor Architecture Advanced Processor Architecture Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong

More information

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Advanced Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000

More information

E0-243: Computer Architecture

E0-243: Computer Architecture E0-243: Computer Architecture L1 ILP Processors RG:E0243:L1-ILP Processors 1 ILP Architectures Superscalar Architecture VLIW Architecture EPIC, Subword Parallelism, RG:E0243:L1-ILP Processors 2 Motivation

More information

Architectural Performance. Superscalar Processing. 740 October 31, i486 Pipeline. Pipeline Stage Details. Page 1

Architectural Performance. Superscalar Processing. 740 October 31, i486 Pipeline. Pipeline Stage Details. Page 1 Superscalar Processing 740 October 31, 2012 Evolution of Intel Processor Pipelines 486, Pentium, Pentium Pro Superscalar Processor Design Speculative Execution Register Renaming Branch Prediction Architectural

More information

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 3, John L. Hennessy and David A. Patterson,

More information

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences Chapter 3: Instruction Level Parallelism (ILP) and its exploitation Pipeline CPI = Ideal pipeline CPI + stalls due to hazards invisible to programmer (unlike process level parallelism) ILP: overlap execution

More information

RISC & Superscalar. COMP 212 Computer Organization & Architecture. COMP 212 Fall Lecture 12. Instruction Pipeline no hazard.

RISC & Superscalar. COMP 212 Computer Organization & Architecture. COMP 212 Fall Lecture 12. Instruction Pipeline no hazard. COMP 212 Computer Organization & Architecture Pipeline Re-Cap Pipeline is ILP -Instruction Level Parallelism COMP 212 Fall 2008 Lecture 12 RISC & Superscalar Divide instruction cycles into stages, overlapped

More information

Hardware Speculation Support

Hardware Speculation Support Hardware Speculation Support Conditional instructions Most common form is conditional move BNEZ R1, L ;if MOV R2, R3 ;then CMOVZ R2,R3, R1 L: ;else Other variants conditional loads and stores nullification

More information

Instruction-Level Parallelism and Its Exploitation

Instruction-Level Parallelism and Its Exploitation Chapter 2 Instruction-Level Parallelism and Its Exploitation 1 Overview Instruction level parallelism Dynamic Scheduling Techniques es Scoreboarding Tomasulo s s Algorithm Reducing Branch Cost with Dynamic

More information

Exploitation of instruction level parallelism

Exploitation of instruction level parallelism Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering

More information

Processing Unit CS206T

Processing Unit CS206T Processing Unit CS206T Microprocessors The density of elements on processor chips continued to rise More and more elements were placed on each chip so that fewer and fewer chips were needed to construct

More information

Dynamic Control Hazard Avoidance

Dynamic Control Hazard Avoidance Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==>

More information

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Dealing With Control Hazards Simplest solution to stall pipeline until branch is resolved and target address is calculated

More information

Structure of Computer Systems

Structure of Computer Systems 288 between this new matrix and the initial collision matrix M A, because the original forbidden latencies for functional unit A still have to be considered in later initiations. Figure 5.37. State diagram

More information

Branch Prediction & Speculative Execution. Branch Penalties in Modern Pipelines

Branch Prediction & Speculative Execution. Branch Penalties in Modern Pipelines 6.823, L15--1 Branch Prediction & Speculative Execution Asanovic Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 6.823, L15--2 Branch Penalties in Modern Pipelines UltraSPARC-III

More information

Pentium IV-XEON. Computer architectures M

Pentium IV-XEON. Computer architectures M Pentium IV-XEON Computer architectures M 1 Pentium IV block scheme 4 32 bytes parallel Four access ports to the EU 2 Pentium IV block scheme Address Generation Unit BTB Branch Target Buffer I-TLB Instruction

More information

Tutorial 11. Final Exam Review

Tutorial 11. Final Exam Review Tutorial 11 Final Exam Review Introduction Instruction Set Architecture: contract between programmer and designers (e.g.: IA-32, IA-64, X86-64) Computer organization: describe the functional units, cache

More information

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation TDT4255 Lecture 9: ILP and speculation Donn Morrison Department of Computer Science 2 Outline Textbook: Computer Architecture: A Quantitative Approach, 4th ed Section 2.6: Speculation Section 2.7: Multiple

More information

EEC 581 Computer Architecture. Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW)

EEC 581 Computer Architecture. Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW) EEC 581 Computer Architecture Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW) Chansu Yu Electrical and Computer Engineering Cleveland State University

More information