CISC, RISC and post-risc architectures

Microprocessor architecture and instruction execution CISC, RISC and Post-RISC architectures Instruction Set Architecture and microarchitecture Instruction encoding and machine instructions Pipelined and superscalar instruction execution Hazards and dependences Branch prediction Out-of-order instruction execution 32- and 64-bit architectures Multicore architectures and hyperhreading Processor architectures for embedded systems 1 CISC, RISC and post-risc architectures CISC Complex Instruction Set Computer large instruction set instructions can perform very complex operations, powerful assembly language variable instruction formats large number of addressing modes few registers machine instructions implemented with microcode RISC Reduced Instruction Set Computer relatively few instructions simple addressing modes, only load/store instructions access memory uniform instruction length many registers no microcode pipelined instruction execution Modern processors have developed further from the basic ideas behind RISC architecture 2 1

Post-RISC architecture Modern processors have developed further from the basic ideas behind RISC architecture exploit more instruction level parallelism Characteristics: parallel instruction execution (superscalar) deep pipeline (superpipelined) advanced branch prediction out-of-order instruction execution register renaming extended instruction set 3 Instruction Set Architecture An abstract description of a processor as it is seen by a (assembly language) programmer or compiler writer abstract model of a processor defines the instructions, registers and mechanisms to access memory that the processor can use to operate on data Specifies the registers machine instructions and their encoding memory addresses addressing modes Examples: Intel IA-32, Intel 64, AMD-64 defines a family of microprocessors, from the 8086 (1978) to the Intel Core i7 all binary compatible (within certain limits) Intel 64 and IA-32 Architectures Software Developer's Manuals, available at http://www.intel.com/products/processor/manuals 4 2

Microarchitecture The microarchitecture of a processor defines how the ISA is implemented in hardware defines how the functionality of the ISA is implemented execution pipeline, functional units, memory organization,... Example (Intel processors) P6 microarchitecture from Intel Pentium Pro to Pentium III Netburst microarchitecture Pentium 4, Xeon Core microarchitecture Core 2, Xeon Nehalem microarchitecture Core i5, Core i7 The physical details (circuit layout, hardware construction, packaging, etc.) is an implementation of the microarchitecture two processors can have the same microarchitecture, but different hardware implementations for instance 90 nm transistor technology, 60 nm, 45 nm high-k metal gate technology or 32 nm technology 5 Instruction encoding Assembly language instructions are encoded into numerical machine instructions by the assembler Instruction formats can be of different types variable length supports varying number of operands typically used in CISC architectures: PDP 11, VAX, Motorola 68000 fixed format always the same number of operands addressing mode is specified as part of the opcode easy to decode, all instructions have the same form typically used in RISC architectures: SPARC, PowerPC, MIPS hybrid format multiple formats, depending on the operation used in most Intel and AMD processors: IA-32, Intel 64, AMD-64 machine instructions are split into micro-operations before they are executed 6 3

Assembly language instructions C code The instruction set specifies the machine instructions that the processor can execute expressed as assembly language instructions Instructions can have 2 or 3 operands add a,b a a + b, result overwrites a add c,a,b c a + b, result placed in c Nr of memory references in an instruction can be 0 load/store (RISC) 1 Intel x86 2 or 3 CISC architectures Translated to binary machine code (opcodes) by an assembler machine instructions can be of different lengths c = a + b! Assembly language register memory load a, R1! add b, R1! sto R1, c! Assembly language load store load a, R1! load b, R2! add R2, R1! sto R1, c! 7 Micro-operations Machine instructions are decoded into micro-operations (µops) before they are executed Simple instructions generate only one µop Example: ADD %RBX, %RAX add the register RBX to the register RAX More complex instructions generate several µops Example: ADD %R8, (MEM) add register R8 to the contents in the memory position with address MEM may generate 3 µops: load the value at address MEM into a register add the value in register R8 to the register store the register to the memory location at address MEM For complex addressing modes the effective address also has to be computed Micro-operations can be efficiently executed out-of-order in a pipelined fashion 8 4

Instruction pipelining Instruction execution is divided into a number of stages instruction fetch instruction decode execute memory writeback Instructions in IF ID X M Results out W The time to move an instruction one step through the pipeline is called a machine cycle can complete one instruction every cycle without pipelining we could complete one instruction every 5 cycles CPI clock Cycles Per Instruction the number of cycles needed to execute an instruction varies for different instructions 9 Pipelined execution Instruction fetch (IF) the next instruction if fetched from memory at the address pointed to by the Program Counter increment PC so that it points to the next instruction Instruction decode (ID) decode the instruction and identify which type it is immediate constant values are extended into 32/64 bits Execution (X) if it s an arithmetic operation, execute it in an ALU if it s a load/store the address of the operand is computed if it s a branch, set the PC to the destination address Memory access (M) if it s a load, fetch the content of the address from memory if it s a store, write the operand to the specified address in memory if it s neither a load nor store, do nothing Writeback (W) write the result of the operation to the destination register if it s a branch or a store, do nothing 10 5

Pipelined instruction execution All pipeline stages can execute in parallel Instruction Level Parallelism separate hardware units for each stage Successive instructions load a, R1! load b, R2! load c, R3! load d, R4! add R2, R1! add #1, R1! sub R3, R1! mul R4, R1! 1 2 3 4 5 6 7 8 9 10 11 Clock cycles After 5 clock cycles, the pipeline is full finishes one instruction every clock period it takes 5 clock periods to complete one instruction Pipelining increases the CPU instruction throughput does not reduce the time to execute an individual instruction 11 Throughput and latency Throughput the number of instructions a pipeline can execute per time unit Latency the number of clock cycles it takes for the pipeline to complete the execution of an individual instruction Different instructions have different latency and throughput Pipeline examples: Pentium 3 has a 14-stage pipeline Pentium 4 has a 20-stage pipeline Core 2 (Nehalem) has a 14 stage pipeline AMD Opteron (Barcelona) has a 12 stage pipeline 12 6

Superscalar architecture Increases the ability of the processor to use instruction level parallelism Multiple instructions are issued every cycle multiple pipelines or functional units operating in parallel Example: 3 parallel pipelines each with 5 stages 3-way superscalar processor 3-issue processor Successive instructions instr. 1! instr. 2! instr. 3! instr. 4! instr. 5! instr. 6! instr. 7! instr. 8! instr. 9!...! 1 2 3 4 5 6 7 Clock cycles 13 Pipeline hazards Pipelined execution is efficient if the flow of instructions through the pipeline can proceed without being interrupted Situations that prevent an instruction in the stream from executing during its clock cycle are called pipeline hazards hazards may force the pipeline to stall may have to stop the instruction fetch for a number of cycles, until all the resources that are needed become available also called a pipeline bubble Structural hazards caused by resource conflicts two instructions need the same hardware unit in the same pipeline stage Data hazards arise when an instruction depends on the result of a previous instruction, which has not completed yet Control hazards caused by branches in the instruction stream 14 7

Structural hazards Caused by resource conflicts the hardware can not simultaneously execute two instructions that need access to the same (single) functional unit for instance, if instructions and data are fetched from the same memory port Can be avoided by duplicating functional units or access paths to memory pipelining functional units stalling the instruction execution for at least one cycle creates a pipeline bubble 15 Data hazards An instruction depends on the result of a previous instruction, which has not completed yet caused by dependences among the data different types of data hazards: read-after-write, write-after-write and write-afterread Example: the loads write the values into the register in the write-back stage R1 will be ready in cycle 4 R2 will be ready in cycle 5 The add must stall until both R1 and R2 are ready Can be avoided by forwarding and register renaming load a, R1! load b, R2! add R1,R2! Clock cycle 0 1 2 3 4 5 6 IF ID X M WB IF ID X M WB 7 8 IF ID X M WB 16 8

Control hazards Branch instructions transfer control in the program execution may assign a new value to the PC Conditional branches may be taken or not taken a taken branch assigns the target address to the PC a branch that is not taken (which falls through) continues at the next instruction The instruction is recognized as a branch in the instruction decode phase can decide whether the branch will be taken or not in the execute stage the next instruction has to stall Can be avoided by branch prediction...! jnz L1! add #1,R2! sub R4,R3! L1:! mov #0,R1! jnz L1! add #1,R2! 0 1 2 3 4 5 6 IF ID X M WB sub R4,R3! IF ID X M Clock cycle WB IF ID X M WB 17 Dependence Pipeline hazards are caused by dependences in the code limit the amount of instruction level parallelism that can be used Can avoid hazards (and pipeline stalls) in the execution of a program by using more advanced instruction execution mechanisms forwarding register renaming instruction scheduling branch prediction dynamic instruction execution Can also eliminate some dependences by code transformations formulate the program in an alternative way, avoiding some dependences 18 9

Data and control dependence Data dependence data must be produced and consumed in the correct order in the program execution Definition: two statements s and t are data dependent if and only if both statements access the same memory location and at least one of them stores into it, and there is a feasible run-time execution path from s to t Control dependence determines the ordering of instructions with respect to branches Example: if p1!! then s1!! else s2;! s1 and s2 are control dependent on p1 we have to first execute p1 before we know which of s1 or s2 should be executed 19 Data dependence Three types of data dependences true dependence anti-dependence output dependence Anti-dependence and output dependence are called name dependences two instructions use the same register or memory locations, but there is no actual flow of data between the two instructions no real dependence between data, only between the names, i.e. the registers or memory locations (variables) that are used to hold the data 20 10

True dependence An instruction depends on data from a previous instruction the first statement stores into a location that is later read by the second statement can not execute the statements in the reverse order can not execute the statements simultaneously in the pipeline without causing a stall x = a*2;! y = x+1;! Corresponds to a Read After Write (RAW) data hazard between the two instructions 21 Antidependence Anti-dependence the first instruction reads from a location into which the second statement stores y = x+a;! x = b;! Corresponds to a Write After Read (WAR) hazard between the two instructions No value is transmitted between the two statements can be executed simultaneously if we choose another name for x in the assignment statement x=b; Can be avoided by using register renaming use different registers for the variable x in the two statements 22 11

Output dependence Output dependence two instructions write to the same register or memory location Corresponds to a Write After Write (WAW) hazard between the two instructions No value is transmitted between the two statements x = x+1;!...! x = b;! can be executed simultaneously if we choose another name for one of the references to x Can be avoided by using register renaming 23 Control dependence Control dependence determine the order of instructions with respect to branches (jumps) in the code if p evaluates to TRUE, the instruction s1 is executed if p evaluates to FALSE, the instruction s1 is not executed, but is branched over! s0;! if (p)! then s1;! s2;! Instruction that are control dependent on a branch can not be moved before the branch instructions from the then-part of an if-statement can not be executed before the branch Instructions that are not control dependent on a branch can not be moved into the then-part Can avoid hazards by using branch prediction speculative execution 24 12

Branch prediction To avoid stalling the pipeline when branch instructions are executed, branch prediction is used it is very important to have a good branch prediction mechanism, since branches are very common in most programs Example: 20% branch instructions in SPECint92 benchmark Branch prediction is only needed for conditional branches, unconditional branches are always taken subroutine calls and goto-statements are always taken returns from subroutines need to be predicted Two types of branch prediction mechanisms: static uses fixed rules for guessing how branches will go dynamic collects information about how the branches have behaved in the past, and use that to predict how the branch will go the next time 25 Mispredicted branches When a misprediction occurs, the processor has executed instructions that should not be executed it has to undo the effects of the falsely executed instructions It is not allowed to change the state of the processor until the branch outcome is known no writeback can be done before the outcome of the branch is ready The instructions that were executed because of a mispredicted branch have to be undone flush out the mispredicted instructions from the pipeline restart the instruction fetch from the correct branch target The performance penalty of a mispredicted branch is typically as many clock cycles as the length of the pipeline if (f(x)>n)! x = 0;! else! x = 1;!...! 26 13

Static branch prediction Fixed rules for predicting how a branch will behave the prediction is not based on the earlier behavior of the branch guess the outcome of the branch, and continue the execution with the predicted instruction Predict as taken / not taken the prediction is the same for all branches Direction-based prediction backward branches are taken forward branches are not taken success rate is about 65% for (i=0; i<n; i++)! {! X[i] = 0;! )! y=1;!...! 27 Dynamic branch prediction Static branch prediction does not consider the previous outcomes of branches branches often show regular and repetitive patterns In dynamic branch prediction the prediction is based on the outcome of previous executions of the branch collect branch history information, on which we base the prediction In practice it is not possible to store information about all branches in a program there is no upper limit on the number of branches in a program use a small and fast internal memory buffer to store information about branch outcomes 28 14

Branch history Branch history information is collected in a small cache memory called the branch history table indexed by a number of bits from the branch instruction address contains branch history information (taken/not taken) two branches may use the same table entry, which may lead to incorrect predictions Stores the outcome of the most recent branch executions need at least one bit in each table entry to store the outcome of the branch (taken / not taken) if no branch history exists, use static prediction The branch history information is used to fetch the next instruction from the predicted target done in the instruction fetch/decode phase in the pipeline 29 Branch target buffer The branch target buffer also stores the target address of the branch we can find the target address of the branch already in the instruction fetch phase if the branch is predicted as taken, we can immediately start fetching instructions from the branch target address Implemented by associative memory Branch prediction taken / not taken Branch instruction Only need to store information address about branches that are predicted to be taken if we find an entry in the table, it is predicted as taken otherwise, we predict it as not taken and continue execution with the following instruction used with one-bit branch history Branch target address 30 15

One bit branch history Predict that the branch goes the same way as the last time it was executed if the prediction turns out to be wrong, invert the prediction bit Mispredicts both the first and the last iteration of a loop misprediction of the last iteration is inevitable, since the branch has been taken N-1 times (in a loop of length N) after executing the last, mispredicted, iteration of the loop the prediction bit is inverted causes a misprediction in the first iteration when we execute the loop the next time T Predict taken NT Predict not taken NT T 31 2-bit branch history taken Use two bits to store the branch history a prediction must miss twice before it is changed gives four different states 00 strongly not taken 01 weakly not taken 10 weakly taken 11 strongly taken can be implemented by a two-bit counter with saturation arithmetic not taken Strong taken Weak taken Weak not taken Strong not taken 11 10 taken 01 00 not taken Can generalize this scheme to N bits, but in practice 2 bits is enough for accurate branch prediction 32 16

Two-level adaptive branch prediction The first level consists of a 4-bit shift register stores the history (taken / not taken) of the four last events of a branch instruction called a branch history register The second level consists of 16 2-bit counters 2-bit branch pattern history states as previously described incremented when branch taken, decremented when not taken The first level register is used as an index to select one of 16 pattern history states (m,n)-correlation based predictor use the m last branches to select one of 2 m predictors the predictors are n-bit Can predict very complex repetitive patterns... all repetitive patterns with a period of five or less 15 2 1 0 0 0 1 0 33 Local and global branch prediction Local branch predictors stores a separate history for each branch captures only the behavior of one branch Global branch predictors include information about several closely located branches captures the behavior of a group of (possibly) correlated branches Example: when the first if-statement is TRUE, the second will also be TRUE if (d==0)! d=1;! if (d==1)!...! 34 17

Tournament branch predictor Multilevel branch predictor, which uses more than one branch prediction method chooses the method that previously has given the best result for each branch stores for each branch a selector to one of the methods, for instance using a 1-bit saturating counter Combines both local and global predictors, and uses the method that is best for each branch Requires more resources than simpler branch prediction methods more internal tables to maintain more complex decision mechanism 35 Predicting call/returns Procedure calls are unconditional branches always taken procedure returns need to be predicted Procedure calls and returns are paired one return for each procedure call can have nested procedure calls Use a return address stack (RAS) as a branch target buffer to predict the return address push the return address when the call instruction is executed pop it when the return instruction is executed Using advanced adaptive branch prediction mechanisms of this kind, it is possible to achieve up to 95% accuracy performance depends strongly on the code 36 18

Register renaming The instruction set architecture defines a set of logical registers visible to the (assembly language) programmer general-purpose registers (EAX, EBX, ECX, EDX,...) special registers (IP, SP, EFLAGS,...) The pipelined execution uses a much larger set of internal physical registers in the program execution called the register file register renaming dynamically associates logical registers to physical registers Arcitectural registers eliminates name dependences R0 The dynamic instruction execution R1 R2 mechanism implements register R3 renaming by the use of reservation R4 R5 stations Physical registers 37 Forwarding Also named short circuiting The result from an instruction is available in internal registers of the functional units after the execution stage no need to wait until the writeback stage to use the result in subsequent operations can use the results before the writeback stage has been completed load a,r0! add #1,R0! add R0,R1! Clock cycle 0 1 2 3 4 5 6 IF ID X M WB IF ID X M WB IF ID X M WB 38 19

In-order execution Instructions are executed in program order limits the opportunities for instruction level parallelism Dependences between instructions force them to be executed in program order the add instruction uses the value loaded into R1 the store instruction uses the value produced by the add in R0 the sub instruction modifies the value in R0 the branch condition depends on R0 L1:! load (R0),R1! add R2,R1! sto R1,(R0)! sub #4,R0! jnz L1! The chain of dependences can be as long as the entire program 39 Out-of-order instruction execution To be able to use more instruction level parallelism the processor may execute instructions out of order also called dynamic instruction execution or dynamic instruction scheduling Out-of-order program execution has to produce the same result as inorder execution instructions that are independent of each other can execute in any order (or in parallel) instructions that are dependent can not be reordered or executed in parallel The out-of-order execution mechanism has to be very conservative and guarantee that it does not change the result of the program 40 20

Out-of-order instruction execution (cont.) In dynamic instruction scheduling instructions are dynamically rearranged in order to keep the pipeline busy The instruction execution is allowed to rearrange the instructions to avoid hazards and keep the pipeline full executes the code in a more efficient way allows code optimized for one processor to execute efficiently on another processor Example: div R1,R0! the add depends on the result of the division add R0,R2! the load-instruction stalls until the div and! add are ready load a,r5! sub R6,R5! No dependences between div/add and load/sub can execute the load/sub before the div/add, or both in parallel 41 Out-of-order execution The instruction fetch, execution and retirement are separated from each other in the pipeline instructions are fetched and decoded in order instructions are executed out of order results of the execution are retired (or written back) in order An instruction can be executed when all its operands are available, and a functional unit for the operation is available result of execution is stored in internal registers retired in program order, written back to registers or memory Instruction fetch Instruction decode and rename Instruction window Issue RS Execution RS Execution Retire Write back 42 21

Tomasulo s algorithm Method for dynamic instruction scheduling implements out-of-order instruction execution R.M. Tomasulo, An Efficient Algorithm for Exploiting Multiple Arithmetic Units, IBM J. of Res.&Dev. 11:1 (Jan 1967) developed for IBM 360/91 Similar out-of-order execution mechanisms are used in most current processors Avoids pipeline stalls due to dependences instructions whose operands are available can execute out of order Combines register renaming out-of-order instruction execution data forwarding (short circuiting) 43 Reservation stations Buffers for operands of instructions that are waiting to be issued fetches and stores operands as soon as they are available replaces the registers in the instruction execution implements register renaming operands are identified by a unique tag if an operand has not yet been computed, the reservation station designates which other reservation station will produce the operand Eliminates the need to fetch/write operands from/to registers don t have to write results back to registers which will be immediately read by another instruction implements register renaming performs the same function as forwarding (short-circuiting) There are more reservation stations than registers 44 22

Reservation stations (cont.) Reservation stations, functional units and load/store buffers are connected by a Common Data Bus (CDB) memory access (load/store) are treated as functional units When the operands of an instruction are available, the instruction can be sent to a functional unit for execution Results of execution are broadcasted on the CDB Reservation stations listen to the CDB for operand values Functional unit Common Data Bus if a matching tag is seen on the bus, the RS copies the value into its operand field all reservation stations waiting for the value are updated at the same time Implemented by associative memory can immediately identify a tag value on the CDB that it is waiting for Reservation stations 45 Organization of Tomasulo s algorithm From memory Instruction fetch Registers Load buffer Reorder buffer Instructions Store buffer Reservation stations To memory Adder Multiplier Common Data Bus 46 23

Data structures in Tomasulos algorithm Have to store data describing the state of instructions in reservation stations, registers and load/store buffers Tags identify entries in reservation stations used as names for an extended set of registers points to the reservation station that holds or that will produce a result needed as an operand in an instruction also used to order the executed instructions in program order when they are retired Issued instructions refer to the operands by tag values not by the register names Registers need one additional field the tag of the reservation station that will produce the result to be stored in this register if zero, no currently active instruction is computing a result destined for this register 47 Stages in Tomasulo s algorithm Issue get the next instruction from the instruction queue get a free reservation station and assign the instruction and its operands to it, if they are available in registers if the operands are not available in registers, assign as operands the reservation stations that will produce them if there is no free reservation station, there is a structural hazard and the instruction must stall until one becomes available Execution if the operands are not ready, monitor the CDB and wait for the operands to be computed when the operands are ready, place them into the corresponding reservation stations dispatch the instruction to the functional unit for execution 48 24

Stages in Tomasulo s algorithm Write result after an instruction is executed, broadcast the result on the CDB all reservation stations that are waiting for this as an operand will receive it mark the reservation station as free Instruction retirement is done in program order has to produce exactly the same result as in in-order execution Loads and stores may require additional steps, as they also have to compute the effective address 49 64-bit architectures Today, most server, desktop and even portable computers are equipped with a 64-bit processor can run both in 32- and 64-bit mode Intel 64 and AMD64 are 64-bit Instruction Set Architectures compatible with each other, some minor differences collectively referred to as x86-64 ISA backwards compatible with the 32-bit ISA IA-32 Other 64-bit architectures Intel IA-64, Alpha, Sun SPARC64, IBM Power, HP PA-RISC, MIPS Most modern operating systems run in 64-bit mode Most compilers can generate 64-bit code 50 25

Advantages of 64-bit architectures Long integers are 64 bits can represent integer values from -2 63 to +2 63-1 dynamic range is about 1.8 * 10 19 Pointers are 64 bits can theoretically address 16 Exabyte of main memory (that is about 18.4 10 18 bytes) current 64-bit systems use 48 address bits can support 256 TB of memory Floating-point values are represented the same way both in 32- and 64-bit architectures defined by the IEEE 754 standard float: 32 bits, double: 64 bits, long double: 80 bits All other numerical values are stored in the same way in 32- and 64-bit architectures 51 64-bit extended register sets In the x86-64 architecture the number of general-purpose registers is extended from 8 to 16 also extended in length from 32 to 64 bits similarly, the number of XMM registers are extended from 8 to 16 64-bit architectures have more registers much more of the program state can be kept is fast registers less need to access local variables on the stack more procedure arguments (up to 6) can be passed in registers More efficient procedure call mechanism less register pressure compilers can generate more compact code 52 26

Other advantages The 64-bit architectures also support new instructions that may not be included in 32-bit architectures compilers for 64-bit architectures have better support for new instructions conditional move instruction floating-point operation with the SSE unit does not always need to keep a stack frame for procedure calls Disadvantage pointers need twice the amount of memory, 8 bytes instead of 4 bytes 53 Multi-core architectures Multi-core processors Dual-core: IBM POWER 5, AMD Opteron, Intel Core 2 Duo Quad-core: Intel Core 2 Quad, AMD Opteron (Barcelona) Six-core: AMD Opteron (Istanbul), Intel Xeon (Westmere) Twelve-core: AMD Opteron (Magny-Cours) Future high-end processors will have a multi-core design need parallel programming to take advantage of multi-core and hyperthreading architectures threaded programming or messagepassing (MPI) 54 27

Hyperthreading Hyperthreading or simultaneous multithreading two (or more) software threads can execute simultaneously on one processor core the processor architecture contains hardware support for efficient execution of threads Improves pipeline efficiency if one thread has to stall, an other thread can use the processor cycles that otherwise would be idle Makes more efficient use of the physical execution resources uses task-level parallelism to increase the utilization of the execution resources Introduced in the Intel Pentium 4 processor in 2002 re-introduced in the Core i7 and Atom processors 55 Implementation of hyperthreading A single CPU appears as two logical processors the architectural state is duplicated (program counter, registers, status flags,... ) both logical processors share the same physical execution resources State The processor fetches and decodes instructions alternating between the two threads if one thread is blocked, the other gets full instruction fetch bandwidth The processor can simultaneously execute instructions from 2 threads State Processor execution resources instructions from both threads flow simultaneously through the pipeline the out-of-order execution mechanism does not need to know from which thread an instruction is 56 28

Processor architectures for embedded systems Processors for embedded systems are intended to be used in consumer products, wireless and networking devices, handheld and mobile devices, automotive industry, etc. Puts very special requirements on the processor architecture broad range of performance requirements from very small 8- or 16-bit processors to 32- or 64-bit processors with advanced signal processing capabilities must be available in a large number of different configurations low power consumption and heat dissipation low cost, high flexibility available also as a processor core which can be integrated with other hardware on the same chip small physical footprint (nr. of transistors) 57 Embedded systems processors Example of an ISA for embedded processors: ARMv6 backwards compatible with previous generations of ARM processors Characteristics of the microarchitecture short pipeline low clock frequency scalar processor design issues one instruction in each clock cycle, in-order execution a few separate functional units (e.g. ALU, Multiply/Add and Load/Store) simple branch prediction mechanism (static and dynamic) small cache size can choose between a number of alternative cache sizes SIMD instructions for media processing 58 29

Code optimization for out-of-order execution and branches High-level code optimization techniques that may improve pipelined execution and execution of branches arrange the code in large blocks of linear code break long dependency chains in the code avoid code with too many branches Create better possibilities of the out-of-order execution to proceed without interruption The compiler can automatically do low-level optimizations more advanced optimizations have to be done by the programmer on source code level Branches are unavoidable in programs branches are not inefficient, provided that they can be correctly predicted if the code has too many branches, they may exceed the capacity of the branch history buffers and lead to mispredictions 59 Avoiding branches Make sure that the compiler can generate conditional move instructions older compiler versions may not use the cmove instructions newer versions of GCC generates conditional moves The conditional expression in C/C++ is likely to be compiled with a cmove Avoid random branches, if possible Avoid indirect jumps and calls jump tables, function pointers Avoid very deep nesting of subroutines otherwise the Return Address Stack may overflow use iterative functions instead of recursive, if possible int max(int a, int b)! {! if (a>b) return a;! else return b;! }! int max(int a, int b)! {! return (a>b)? a : b;! }! 60 30

Branch density If possible, avoid code that contains too many branches avoid complex logical expressions that generate dense conditional branches, especially if the branch bodies are small in the AMD Opteron, more than three branches in a 16-byte code block leads to resource conflicts in the branch target buffer causes unnecessary branch misprediction Branches can be eliminated by using conditional move or conditional set instructions it may also be possible to rewrite complex branches with assembly language code that uses conditional moves 61 Order of evaluation in Boolean expressions C and C++ uses short-circuit evaluation for compound Boolean expressions in a Boolean expression (a OP b), the second argument is not evaluated if the first argument alone determines the value of the expression if a evaluates to TRUE in an expression if (a b), then b is not evaluated if a evaluates to FALSE in an expression if (a&&b), then b is not evaluated If one of the expressions is known to be true more often than the other, arrange the expressions so that the evaluation is shortened if a is known to be TRUE 60% of the time and b is TRUE 10% of the time then you should arrange them as (b&&a) and (a b) If one expression is more predictable, place that first If one expression is much faster to calculate, place that first If the Boolean expressions have side effects or are dependent, they can not be necessarily be rearranged 62 31

Avoid unnecessary branches Use if-else-if constructs to avoid branches if the cases are mutually exclusive no need to evaluate all if-statements A switch statement is even better A table lookup can sometimes also be used no branch instructions in the generated assembly code double select(int a) {! double result;! if (a==0) result = 1.13;! if (a==1) result = 2.56;! if (a==2) result = 3.67;! if (a==3) result = 4.16;! if (a==4) result = 8.12;! return result;! }! double select(int a) {! double result;! if (a==0) result = 1.13;! else if (a==1) result = 2.56;! else if (a==2) result = 3.67;! else if (a==3) result = 4.16;! else if (a==4) result = 8.12;! return result;! }! double select(int a) {! double result[5] = {1.13, 2.56, 3.67, 4.16, 8.12};! return result[a];! }! 63 Order of branches Order branches in if- and switch-statements so that the most likely case comes first the other cases do not have to be evaluated if the first one is TRUE Use contiguously numbered case expressions, if possible the compiler can translate the switch-statement into a jump table if they are non-contiguous, use a series of if-else statements instead switch (value) {/* Most likely case first */! case 0: handle_0(); break;! case 1: handle_1(); break;! case 2: handle_2(); break;! case 3: handle_3(); break;! }! if (a==0) {! /* Handle case for a==0 */! }! else if (a==8) {! /* Handle case for a==8 */! }! else {! /* Handle default case */! }! 64 32

Loop unswitching Move loop-invariant conditional constructs out of the loop if- or switch-statements which are independent of the loop index can be moved outside of the loop the loop is instead repeated in the different branches of the if- or switchstatement removes branch instructions from within the loop Removes branch instructions, increases instruction level parallelism, improves possibilities to parallelize the loop but increases the amount of code for (i=0; i<n; i++)! { if (a>0)! X[i] = a;! else! X[i] = 0;! }! if (a>0)! { for (i=0; i<n; i++)! X[i] = a;! }! else! { for (i=0; i<n; i++)! X[i] = 0;! }! 65 Breaking up dependences Long dependency chains may slow down pipelined execution each instruction is dependent on data produced in the previous instruction Try to break up dependencies by computing partial results that are independent of each other compute parts of a complex expression in separate temporary variables Floating-point expressions are evaluated in order specified in the source code b+c+d+e is evaluated as ((b+c)+d)+e Can explicitly specify the associative order of the computation may affect the precision for floatingpoint expressions (but not for integer) double a, b, c, d, e;! a = b+c+d+e;! double a, b, c, d, e;! a = (b+c)+(d+e);! 66 33

Loop unrolling Repeat the body of a loop k times and go through the loop with a step length k k is called the unrolling factor reduces branches, enables better instruction scheduling In general, we can not assume that the number of iterations (N) is divisible by the unrolling factor (k) have to modify the loop limit and take care of remaining elements for (i=0; i<n; i++)! sum += X[i];! const int k=5; /* Unrolling factor */! int limit = length-(k-1);! sum=0.0;! for (i=0; i<limit; i+=k) {! sum += X[i];! sum += X[i+1}; /* Unroll by 5 */! sum += X[i+2];! sum += X[i+3];! sum += X[i+4];! }! /* Finish remaining elements */! for (; i<length; i++)! sum += X[i];! 67 Loop unrolling, breaking dependences Can also break up dependences by computing partial results in the unrolled loop use 5 separate partial sums computation of sum0 sum4 are independent and can be done at the same time Increases register use if some of the variables have to be kept I memory instead of in register, the code may become slower const int k=5;! int limit = length-(k-1);! sum0=sum1=sum2=sum3=sum4=0.0;! for (i=0; i<limit; i+=k) {! sum0 += X[i];! sum1 += X[i+1];! sum2 += X[i+2];! sum3 += X[i+3];! sum4 += X[i+4];! }! for (; i<length; i++)! sum0 += X[i];! sum0 += sum1+sum2+sum3+sum4;! 68 34

Unrolling small loops Small loops, with a fixed loop count and a small loop body can be completely unrolled Example: perspective projection in 3D computer graphics /* 3D transform! Multiply the vector v with a 4x4 transformation matrix m */! for (i=0; i<4; i++) {! r[i] = 0;! for (j=0; j<4; j++) {! r[i] = m[i][j] * v[j];! }! }! r[0] = m[0][0]*v[0] + m[1][0]*v[1] + m[2][0]*v[2] + m[3][0]*v[3];! r[1] = m[0][1]*v[0] + m[1][1]*v[1] + m[2][1]*v[2] + m[3][1]*v[3];! r[2] = m[0][2]*v[0] + m[1][2]*v[1] + m[2][2]*v[2] + m[3][2]*v[3];! r[3] = m[0][3]*v[0] + m[1][3]*v[1] + m[2][3]*v[2] + m[3][3]*v[3];!! 69 35