Homework 5. Start date: March 24 Due date: 11:59PM on April 10, Monday night. CSCI 402: Computer Architectures

Similar documents
COMPUTER ORGANIZATION AND DESIGN. The Hardware/Software Interface. Chapter 4. The Processor: C Multiple Issue Based on P&H

COMPUTER ORGANIZATION AND DESI

Processor (IV) - advanced ILP. Hwansoo Han

The Processor: Instruction-Level Parallelism

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced Instruction-Level Parallelism

5 th Edition. The Processor We will examine two MIPS implementations A simplified version A more realistic pipelined version

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

Chapter 4 The Processor (Part 4)

Lec 25: Parallel Processors. Announcements

Chapter 4. The Processor

Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version

Chapter 4. The Processor

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

Full Datapath. Chapter 4 The Processor 2

EIE/ENE 334 Microprocessors

Thomas Polzer Institut für Technische Informatik

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Chapter 4. The Processor

LECTURE 3: THE PROCESSOR

Chapter 4 The Processor 1. Chapter 4D. The Processor

Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

Multicore and Parallel Processing

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 32: Pipeline Parallelism 3

CS 61C: Great Ideas in Computer Architecture. Multiple Instruction Issue, Virtual Memory Introduction

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

COMPUTER ORGANIZATION AND DESIGN

Adapted from instructor s. Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK]

Chapter 4. The Processor. Jiang Jiang

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. P & H Chapter 4.10, 1.7, 1.8, 5.10, 6

Lecture 2: Processor and Pipelining 1

LECTURE 10. Pipelining: Advanced ILP

IF1/IF2. Dout2[31:0] Data Memory. Addr[31:0] Din[31:0] Zero. Res ALU << 2. CPU Registers. extension. sign. W_add[4:0] Din[31:0] Dout[31:0] PC+4

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

Pipelining. CSC Friday, November 6, 2015

Multiple Instruction Issue. Superscalars

Outline. A pipelined datapath Pipelined control Data hazards and forwarding Data hazards and stalls Branch (control) hazards Exception

Chapter 4. The Processor

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

Static, multiple-issue (superscaler) pipelines

1 Hazards COMP2611 Fall 2015 Pipelined Processor

There are different characteristics for exceptions. They are as follows:

14:332:331 Pipelined Datapath

The Processor: Improving the performance - Control Hazards

CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards

CSEE 3827: Fundamentals of Computer Systems

In-order vs. Out-of-order Execution. In-order vs. Out-of-order Execution

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Instruction Level Parallelism: Multiple Instruction Issue

Outline Marquette University

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Advanced processor designs

Parallelism, Multicore, and Synchronization

CS 61C: Great Ideas in Computer Architecture. Lecture 13: Pipelining. Krste Asanović & Randy Katz

CS3350B Computer Architecture Quiz 3 March 15, 2018

DEE 1053 Computer Organization Lecture 6: Pipelining

CS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin. School of Information Science and Technology SIST

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Hardware-based Speculation

Multi-cycle Instructions in the Pipeline (Floating Point)

Chapter 4. The Processor

Full Datapath. Chapter 4 The Processor 2

Multiple Issue ILP Processors. Summary of discussions

Four Steps of Speculative Tomasulo cycle 0

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.

Instr. execution impl. view

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar

Chapter 4 The Processor 1. Chapter 4B. The Processor

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

Pipelined Datapath. Reading. Sections Practice Problems: 1, 3, 8, 12

COMPUTER ORGANIZATION AND DESIGN

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

Pipelining. Ideal speedup is number of stages in the pipeline. Do we achieve this? 2. Improve performance by increasing instruction throughput ...

Orange Coast College. Business Division. Computer Science Department. CS 116- Computer Architecture. Pipelining

CS425 Computer Systems Architecture

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

EE557--FALL 1999 MAKE-UP MIDTERM 1. Closed books, closed notes

Advanced issues in pipelining

Dynamic Control Hazard Avoidance

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition. Chapter 4. The Processor

Suggested Readings! Recap: Pipelining improves throughput! Processor comparison! Lecture 17" Short Pipelining Review! ! Readings!

Chapter 4. The Processor

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

Processor (II) - pipelining. Hwansoo Han

CS/CoE 1541 Mid Term Exam (Fall 2018).

Control Hazards. Branch Prediction

Pipelined Datapath. Reading. Sections Practice Problems: 1, 3, 8, 12 (2) Lecture notes from MKP, H. H. Lee and S.

Transcription:

Homework 5 Start date: March 24 Due date: 11:59PM on April 10, Monday night 4.1.1, 4.1.2 4.3 4.8.1, 4.8.2 4.9.1-4.9.4 4.13.1 4.16.1, 4.16.2 1 CSCI 402: Computer Architectures The Processor (4) Fengguang Song Department of Computer & Information Science IUPUI 1

4.9-4.15 Exceptions Today s Contents How to Improve Instruction Level Parallelism (ILP) Multiple issues Dynamic pipeline scheduling (out-of-order) Speculation Software technique: Loop Unrolling Real stuff: ARM Cortex-A8, Intel Core i7 3 Unexpected Events Unexpected Events that require a change in flow of CPU control E.g., overflow, int div by 0, invalid memory access, unaligned memory address, hardware failure, etc But they are not the same as beq, jump, Most often categorized into two types: 1) Exception Arises within the CPU Such as undefined opcode, arithmetic overflow, system calls Refers to internal events 2) Interrupt Comes from an external I/O device Such as network card, HDD (hard drive disk) request, keyboard, Refers to external events On MIPS, both are called Exceptions (don t distinguish them) In pipelined processors, how to deal with the Unexpected Events? 4 2

How Does MIPS Handle Exceptions? Exceptions are managed by a System cco-coco-processor 0(CP0) CP1 is for floating point arithmetic Any exception will trigger an (unscheduled) procedure call 4 Steps to handle exception: 1) Save PC of the offending instruction to Exception Program Counter (EPC), 2) Save indication of the problem to Cause Register (4 bits for exception code) E.g., here we use 1 bit because of our simple implementation 0 means undefined opcode, 1 means overflow 3) Jump to handler located at 0x8000 00180 (i.e. transfer control to OS) A single entry function to OS 5 Handling Exceptions in MIPS 4) After jumping to the handler: Handler reads the cause register (within OS) Calls the relevant procedure Determines a required action 1. If re-startable Correct the error Use EPC to return to program 2. Otherwise Terminate program Report error using EPC, cause, 7 3

8 Exceptions in a Pipeline How to stop the pipeline and handle exceptions? For MIPS: just another form of control hazard Similar to handling a mispredicted branch Use much of the same hardware E.g., an overflow occurs in EX stage add $1,$2,$1 Complete previous instructions in pipeline Flush add and its subsequent instructions in pipeline set EX.flush = 1, ID.flush=1, IF.flush=1 Set EPC and Cause registers Transfer control to the exception handler 9 4

Datapath + Controls to Handle Exceptions 10 An Exception Example Exception on add in 40 sub $11, $2, $4 44 and $12, $2, $5 48 or $13, $2, $6 4C add $1, $2, $1 50 slt $15, $6, $7 54 lw $16, 50($7) Handler procedure: 80000180 sw $25, 1000($0) 80000184 sw $26, 1004($0) 11 5

X Exception Example X X 12 handler Exception Example 13 6

Exploit Instruction-Level Parallelism (ILP) Pipelining: can execute multiple instructions in parallel This type of parallelism is called ILP How to further increase ILP? 1 st solution: Deeper pipeline More instruction being overlapped Less work per stage Þ clock cycle can become shorter 2 nd : Multiple issue Replicate data-path components Þ multiple instructions per stage Can start multiple instructions per clock cycle This technique is called multiple issue CPI < 1, so we use Instructions Per Cycle (IPC) e.g., 4GHz, 4-way multiple-issue 16 BIPS, peak CPI = 0.25, peak IPC = 4 However, dependencies may reduce IPC in practice 17 Implementations of Multiple Issue Static Multiple Issue Compiler groups instructions (issue one group at a time) Compiler is responsible for detecting and avoiding hazards Dynamic Multiple Issue CPU hardware examines instruction stream, and chooses instructions to issue each cycle CPU resolves hazards using advanced techniques at runtime 18 7

The Speculation Technique Used in both static and dynamic multiple issue An important method to enable more ILP The compiler or the processor guesses the outcome of an instruction In order to enable and start execution of other instructions Next, need to check whether the guess was right If right, complete the operation If not, roll-back and do the right thing E.g., Speculate on branch outcome Roll back if path taken is different Speculate on store instructions (e.g., store [M]à load [M]) Assume different addresses; roll back if location is updated 19 Static Multiple Issue Compiler groups instructions into issue packets Group of instructions that can be issued on a single cycle Determined by pipeline resources required Think of an issue packet as a very long instruction Can specify multiple concurrent operations Compiler must remove some/all hazards Reorder instructions into issue packets No dependencies within a packet OK to have dependencies between packets Pad with nop if necessary 22 8

MIPS with Static Dual Issue Two-issue packets: One ALU/branch instruction One load/store instruction 64-bit aligned ALU/branch + load/store Pad an unused instruction with nop Address Instruction type Pipeline Stages n ALU/branch IF ID EX MEM WB n + 4 Load/store IF ID EX MEM WB n + 8 ALU/branch IF ID EX MEM WB n + 12 Load/store IF ID EX MEM WB n + 16 ALU/branch IF ID EX MEM WB n + 20 Load/store IF ID EX MEM WB 23 Datapath with Static Dual Issue 24 9

Loop Unrolling Replicate the loop body to expose more parallelism. Can reduce loop-control overhead (branch hazards) Can expose a bigger instruction pool to schedule/reorder 26 A Code Scheduling Example Schedule this for dual-issue MIPS Loop: lw $t0, 0($s1) # $t0=array element addu $t0, $t0, $s2 # add scalar in $s2 sw $t0, 0($s1) # store result addi $s1, $s1, 4 # decrement pointer bne $s1, $zero, Loop # branch $s1!=0 ALU/branch Load/store cycle Loop: nop lw $t0, 0($s1) 1 addi $s1, $s1, 4 nop 2 addu $t0, $t0, $s2 nop 3 bne $s1, $zero, Loop sw $t0, 4($s1) 4 n IPC = 5/4 = 1.25 (c.f. peak IPC = 2) 27 10

Loop Unrolling Example Unroll 4 times of the loop body ALU/branch Load/store cycle Loop: addi $s1, $s1, 16 lw $t0, 0($s1) 1 nop lw $t1, 12($s1) 2 addu $t0, $t0, $s2 lw $t2, 8($s1) 3 addu $t1, $t1, $s2 lw $t3, 4($s1) 4 addu $t2, $t2, $s2 sw $t0, 16($s1) 5 addu $t3, $t3, $s2 sw $t1, 12($s1) 6 nop sw $t2, 8($s1) 7 bne $s1, $zero, Loop sw $t3, 4($s1) 8 IPC = 14/8 = 1.75 Closer to 2 (at cost of more registers and code size) Notice 12 out of 14 instructions are in pairs! 28 Matrix Multiply n Unrolled C code 1 #include <x86intrin.h> 2 #define UNROLL (4) 3 4 void dgemm (int n, double* A, double* B, double* C) 5 { 6 for ( int i = 0; i < n; i+=unroll*4 ) 7 for ( int j = 0; j < n; j++ ) { 8 m256d c[4]; 9 for ( int x = 0; x < UNROLL; x++ ) //can do it manually! 10 c[x] = _mm256_load_pd(c+i+x*4+j*n); 11 12 for( int k = 0; k < n; k++ ) 13 { 14 m256d b = _mm256_broadcast_sd(b+k+j*n); 15 for (int x = 0; x < UNROLL; x++) 16 c[x] = _mm256_add_pd(c[x], 17 _mm256_mul_pd(_mm256_load_pd(a+n*k+x*4+i), b)); 18 } 19 20 for ( int x = 0; x < UNROLL; x++ ) 21 _mm256_store_pd(c+i+x*4+j*n, c[x]); 22 } 23 } 29 11

Performance Impact 30 Dynamic Multiple Issue Also called Superscalar processors or Superscalars CPU will decide whether to issue 0, 1, 2, 3, or 4 in each cycle And avoiding structural and data hazards Avoids the need for compiler-based scheduling Though compiler may still help Correct code semantics is ensured by the CPU 31 12

Dynamic Pipeline Scheduling Allow the CPU to execute instructions out of order to avoid stalls, but commit result to registers in order Example lw $t0, 20($s2) addu $t1, $t0, $t2 sub $s4, $s4, $t3 slti $t5, $s4, 20 Can start sub while addu is waiting for lw Note: lw could take hundreds of cycles (if data not in cache) 32 Dynamically Scheduled CPU Preserves dependencies Hold pending operands Results also sent to any waiting reservation stations Reorders buffer for register writes Can supply operands for issued instructions 33 13

Why Do We Need Dynamic Scheduling? Why not just let the compiler schedule code? Not all stalls are predicable e.g., cache misses Can t always schedule around branches Branch outcome is dynamically determined Different implementations of an ISA have different latencies and hazards 36 Does Multiple Issue Work? The BIG Picture Yes, but not as much as we d like Programs have real dependencies that limit ILP Some dependencies are hard to eliminate e.g., pointer aliasing Some parallelism is hard to expose Limited window size during instruction issue Memory delays and limited bandwidth Hard to keep pipelines full Speculation can help if done well 37 14

Problem: Power Inefficiency Vendors used to turn transistors into performance (<2004) However, complexity of multiple issue, deeper pipeline, dynamic scheduling and speculations requires power à hitting the Power Wall Multiple simpler slower cores may be better 38 ARM Cortex A8 VS Intel i7 Processor ARM A8 Intel Core i7 920 Market Personal Mobile Device Server, cloud Thermal design power 2 Watts 130 Watts Clock rate 1 GHz 2.66 GHz Cores/Chip 1 4 Floating point? No Yes Multiple issue? Dynamic Dynamic Peak instructions/clock cycle 2 4 Pipeline stages 14 14 Pipeline schedule Static in-order Dynamic out-of-order with speculation Branch prediction 2-level 2-level 1 st level caches/core 32 KiB I, 32 KiB D 32 KiB I, 32 KiB D 2 nd level caches/core 128-1024 KiB 256 KiB 3 rd level caches (shared) - 2-8 MB 39 15

ARM Cortex-A8 Pipeline 40 Core i7 Pipeline X86 instructions micro operations Branch mispredict penalty: ~15 cycles 41 16

Fallacies Pipelining is easy! The basic idea is easy The devil is in the details e.g., detecting data hazards Pipelining is independent of technology? More transistors will make more advanced techniques feasible => multiple issue, dynamic pipeline scheduling Today, concern about power leads to less aggressive pipeline designs Hence, pipeline-related ISA design needs to take account of technology trends 42 Pitfalls Poor ISA design can make pipelining s implementation harder E.g., complex instruction sets (VAX, IA-32) Variable instruction length and run time lead to imbalanced pipeline stages Its workaround: IA-32 micro-op approach E.g., complex addressing modes May involve multiple memory accesses Complicate the pipeline control 43 17

Concluding Remarks ISA influences design of both datapath and control Datapath and control influence the design of ISA Pipelining improves instruction throughput using parallelism More instructions completed per second Latency for each instruction not reduced Hazards: Structural, data, control Multiple issue, and dynamic scheduling Data dependencies limit the achievable parallelism Meanwhile, complexity (sophisticated design, control) leads to the power wall 44 18