Homework 5. Start date: March 24 Due date: 11:59PM on April 10, Monday night. CSCI 402: Computer Architectures

Homework 5 Start date: March 24 Due date: 11:59PM on April 10, Monday night 4.1.1, 4.1.2 4.3 4.8.1, 4.8.2 4.9.1-4.9.4 4.13.1 4.16.1, 4.16.2 1 CSCI 402: Computer Architectures The Processor (4) Fengguang Song Department of Computer & Information Science IUPUI 1

4.9-4.15 Exceptions Today s Contents How to Improve Instruction Level Parallelism (ILP) Multiple issues Dynamic pipeline scheduling (out-of-order) Speculation Software technique: Loop Unrolling Real stuff: ARM Cortex-A8, Intel Core i7 3 Unexpected Events Unexpected Events that require a change in flow of CPU control E.g., overflow, int div by 0, invalid memory access, unaligned memory address, hardware failure, etc But they are not the same as beq, jump, Most often categorized into two types: 1) Exception Arises within the CPU Such as undefined opcode, arithmetic overflow, system calls Refers to internal events 2) Interrupt Comes from an external I/O device Such as network card, HDD (hard drive disk) request, keyboard, Refers to external events On MIPS, both are called Exceptions (don t distinguish them) In pipelined processors, how to deal with the Unexpected Events? 4 2

How Does MIPS Handle Exceptions? Exceptions are managed by a System cco-coco-processor 0(CP0) CP1 is for floating point arithmetic Any exception will trigger an (unscheduled) procedure call 4 Steps to handle exception: 1) Save PC of the offending instruction to Exception Program Counter (EPC), 2) Save indication of the problem to Cause Register (4 bits for exception code) E.g., here we use 1 bit because of our simple implementation 0 means undefined opcode, 1 means overflow 3) Jump to handler located at 0x8000 00180 (i.e. transfer control to OS) A single entry function to OS 5 Handling Exceptions in MIPS 4) After jumping to the handler: Handler reads the cause register (within OS) Calls the relevant procedure Determines a required action 1. If re-startable Correct the error Use EPC to return to program 2. Otherwise Terminate program Report error using EPC, cause, 7 3

8 Exceptions in a Pipeline How to stop the pipeline and handle exceptions? For MIPS: just another form of control hazard Similar to handling a mispredicted branch Use much of the same hardware E.g., an overflow occurs in EX stage add $1,$2,$1 Complete previous instructions in pipeline Flush add and its subsequent instructions in pipeline set EX.flush = 1, ID.flush=1, IF.flush=1 Set EPC and Cause registers Transfer control to the exception handler 9 4

Datapath + Controls to Handle Exceptions 10 An Exception Example Exception on add in 40 sub $11, $2, $4 44 and $12, $2, $5 48 or $13, $2, $6 4C add $1, $2, $1 50 slt $15, $6, $7 54 lw $16, 50($7) Handler procedure: 80000180 sw $25, 1000($0) 80000184 sw $26, 1004($0) 11 5

X Exception Example X X 12 handler Exception Example 13 6

Exploit Instruction-Level Parallelism (ILP) Pipelining: can execute multiple instructions in parallel This type of parallelism is called ILP How to further increase ILP? 1 st solution: Deeper pipeline More instruction being overlapped Less work per stage Þ clock cycle can become shorter 2 nd : Multiple issue Replicate data-path components Þ multiple instructions per stage Can start multiple instructions per clock cycle This technique is called multiple issue CPI < 1, so we use Instructions Per Cycle (IPC) e.g., 4GHz, 4-way multiple-issue 16 BIPS, peak CPI = 0.25, peak IPC = 4 However, dependencies may reduce IPC in practice 17 Implementations of Multiple Issue Static Multiple Issue Compiler groups instructions (issue one group at a time) Compiler is responsible for detecting and avoiding hazards Dynamic Multiple Issue CPU hardware examines instruction stream, and chooses instructions to issue each cycle CPU resolves hazards using advanced techniques at runtime 18 7

The Speculation Technique Used in both static and dynamic multiple issue An important method to enable more ILP The compiler or the processor guesses the outcome of an instruction In order to enable and start execution of other instructions Next, need to check whether the guess was right If right, complete the operation If not, roll-back and do the right thing E.g., Speculate on branch outcome Roll back if path taken is different Speculate on store instructions (e.g., store [M]à load [M]) Assume different addresses; roll back if location is updated 19 Static Multiple Issue Compiler groups instructions into issue packets Group of instructions that can be issued on a single cycle Determined by pipeline resources required Think of an issue packet as a very long instruction Can specify multiple concurrent operations Compiler must remove some/all hazards Reorder instructions into issue packets No dependencies within a packet OK to have dependencies between packets Pad with nop if necessary 22 8

MIPS with Static Dual Issue Two-issue packets: One ALU/branch instruction One load/store instruction 64-bit aligned ALU/branch + load/store Pad an unused instruction with nop Address Instruction type Pipeline Stages n ALU/branch IF ID EX MEM WB n + 4 Load/store IF ID EX MEM WB n + 8 ALU/branch IF ID EX MEM WB n + 12 Load/store IF ID EX MEM WB n + 16 ALU/branch IF ID EX MEM WB n + 20 Load/store IF ID EX MEM WB 23 Datapath with Static Dual Issue 24 9

Loop Unrolling Replicate the loop body to expose more parallelism. Can reduce loop-control overhead (branch hazards) Can expose a bigger instruction pool to schedule/reorder 26 A Code Scheduling Example Schedule this for dual-issue MIPS Loop: lw $t0, 0($s1) # $t0=array element addu $t0, $t0, $s2 # add scalar in $s2 sw $t0, 0($s1) # store result addi $s1, $s1, 4 # decrement pointer bne $s1, $zero, Loop # branch $s1!=0 ALU/branch Load/store cycle Loop: nop lw $t0, 0($s1) 1 addi $s1, $s1, 4 nop 2 addu $t0, $t0, $s2 nop 3 bne $s1, $zero, Loop sw $t0, 4($s1) 4 n IPC = 5/4 = 1.25 (c.f. peak IPC = 2) 27 10

Loop Unrolling Example Unroll 4 times of the loop body ALU/branch Load/store cycle Loop: addi $s1, $s1, 16 lw $t0, 0($s1) 1 nop lw $t1, 12($s1) 2 addu $t0, $t0, $s2 lw $t2, 8($s1) 3 addu $t1, $t1, $s2 lw $t3, 4($s1) 4 addu $t2, $t2, $s2 sw $t0, 16($s1) 5 addu $t3, $t3, $s2 sw $t1, 12($s1) 6 nop sw $t2, 8($s1) 7 bne $s1, $zero, Loop sw $t3, 4($s1) 8 IPC = 14/8 = 1.75 Closer to 2 (at cost of more registers and code size) Notice 12 out of 14 instructions are in pairs! 28 Matrix Multiply n Unrolled C code 1 #include <x86intrin.h> 2 #define UNROLL (4) 3 4 void dgemm (int n, double* A, double* B, double* C) 5 { 6 for ( int i = 0; i < n; i+=unroll*4 ) 7 for ( int j = 0; j < n; j++ ) { 8 m256d c[4]; 9 for ( int x = 0; x < UNROLL; x++ ) //can do it manually! 10 c[x] = _mm256_load_pd(c+i+x*4+j*n); 11 12 for( int k = 0; k < n; k++ ) 13 { 14 m256d b = _mm256_broadcast_sd(b+k+j*n); 15 for (int x = 0; x < UNROLL; x++) 16 c[x] = _mm256_add_pd(c[x], 17 _mm256_mul_pd(_mm256_load_pd(a+n*k+x*4+i), b)); 18 } 19 20 for ( int x = 0; x < UNROLL; x++ ) 21 _mm256_store_pd(c+i+x*4+j*n, c[x]); 22 } 23 } 29 11

Performance Impact 30 Dynamic Multiple Issue Also called Superscalar processors or Superscalars CPU will decide whether to issue 0, 1, 2, 3, or 4 in each cycle And avoiding structural and data hazards Avoids the need for compiler-based scheduling Though compiler may still help Correct code semantics is ensured by the CPU 31 12

Dynamic Pipeline Scheduling Allow the CPU to execute instructions out of order to avoid stalls, but commit result to registers in order Example lw $t0, 20($s2) addu $t1, $t0, $t2 sub $s4, $s4, $t3 slti $t5, $s4, 20 Can start sub while addu is waiting for lw Note: lw could take hundreds of cycles (if data not in cache) 32 Dynamically Scheduled CPU Preserves dependencies Hold pending operands Results also sent to any waiting reservation stations Reorders buffer for register writes Can supply operands for issued instructions 33 13

Why Do We Need Dynamic Scheduling? Why not just let the compiler schedule code? Not all stalls are predicable e.g., cache misses Can t always schedule around branches Branch outcome is dynamically determined Different implementations of an ISA have different latencies and hazards 36 Does Multiple Issue Work? The BIG Picture Yes, but not as much as we d like Programs have real dependencies that limit ILP Some dependencies are hard to eliminate e.g., pointer aliasing Some parallelism is hard to expose Limited window size during instruction issue Memory delays and limited bandwidth Hard to keep pipelines full Speculation can help if done well 37 14

Problem: Power Inefficiency Vendors used to turn transistors into performance (<2004) However, complexity of multiple issue, deeper pipeline, dynamic scheduling and speculations requires power à hitting the Power Wall Multiple simpler slower cores may be better 38 ARM Cortex A8 VS Intel i7 Processor ARM A8 Intel Core i7 920 Market Personal Mobile Device Server, cloud Thermal design power 2 Watts 130 Watts Clock rate 1 GHz 2.66 GHz Cores/Chip 1 4 Floating point? No Yes Multiple issue? Dynamic Dynamic Peak instructions/clock cycle 2 4 Pipeline stages 14 14 Pipeline schedule Static in-order Dynamic out-of-order with speculation Branch prediction 2-level 2-level 1 st level caches/core 32 KiB I, 32 KiB D 32 KiB I, 32 KiB D 2 nd level caches/core 128-1024 KiB 256 KiB 3 rd level caches (shared) - 2-8 MB 39 15

ARM Cortex-A8 Pipeline 40 Core i7 Pipeline X86 instructions micro operations Branch mispredict penalty: ~15 cycles 41 16

Fallacies Pipelining is easy! The basic idea is easy The devil is in the details e.g., detecting data hazards Pipelining is independent of technology? More transistors will make more advanced techniques feasible => multiple issue, dynamic pipeline scheduling Today, concern about power leads to less aggressive pipeline designs Hence, pipeline-related ISA design needs to take account of technology trends 42 Pitfalls Poor ISA design can make pipelining s implementation harder E.g., complex instruction sets (VAX, IA-32) Variable instruction length and run time lead to imbalanced pipeline stages Its workaround: IA-32 micro-op approach E.g., complex addressing modes May involve multiple memory accesses Complicate the pipeline control 43 17

Concluding Remarks ISA influences design of both datapath and control Datapath and control influence the design of ISA Pipelining improves instruction throughput using parallelism More instructions completed per second Latency for each instruction not reduced Hazards: Structural, data, control Multiple issue, and dynamic scheduling Data dependencies limit the achievable parallelism Meanwhile, complexity (sophisticated design, control) leads to the power wall 44 18