Exploitation of instruction level parallelism

Similar documents
Lecture-13 (ROB and Multi-threading) CS422-Spring

TDT 4260 lecture 7 spring semester 2015

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Super Scalar. Kalyan Basu March 21,

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Hardware-based Speculation

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Instruction Level Parallelism (ILP)

Instruction Level Parallelism

Hardware-Based Speculation

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Instruction-Level Parallelism and Its Exploitation

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

CMSC411 Fall 2013 Midterm 2 Solutions

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

ILP: Instruction Level Parallelism

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

Handout 2 ILP: Part B

Copyright 2012, Elsevier Inc. All rights reserved.

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

TDT 4260 TDT ILP Chap 2, App. C

Advanced cache memory optimizations

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

Advanced Computer Architecture

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Instruction Level Parallelism

Getting CPI under 1: Outline

Chapter 4 The Processor 1. Chapter 4D. The Processor

INSTRUCTION LEVEL PARALLELISM

Review Tomasulo. Lecture 17: ILP and Dynamic Execution #2: Branch Prediction, Multiple Issue. Tomasulo Algorithm and Branch Prediction

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Metodologie di Progettazione Hardware-Software

CS425 Computer Systems Architecture

5008: Computer Architecture

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.

Lecture 7 Instruction Level Parallelism (5) EEC 171 Parallel Architectures John Owens UC Davis

Instruction Level Parallelism. Taken from

Multithreading: Exploiting Thread-Level Parallelism within a Processor

EE 4683/5683: COMPUTER ARCHITECTURE

Page 1. Today s Big Idea. Lecture 18: Branch Prediction + analysis resources => ILP

CS 654 Computer Architecture Summary. Peter Kemper

Multi-cycle Instructions in the Pipeline (Floating Point)

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

IF1/IF2. Dout2[31:0] Data Memory. Addr[31:0] Din[31:0] Zero. Res ALU << 2. CPU Registers. extension. sign. W_add[4:0] Din[31:0] Dout[31:0] PC+4

LIMITS OF ILP. B649 Parallel Architectures and Programming

CS 1013 Advance Computer Architecture UNIT I

Hardware-Based Speculation

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

EECC551 Exam Review 4 questions out of 6 questions

NOW Handout Page 1. Review from Last Time. CSE 820 Graduate Computer Architecture. Lec 7 Instruction Level Parallelism. Recall from Pipelining Review

UNIT I (Two Marks Questions & Answers)

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

Instruction-Level Parallelism (ILP)

Multithreaded Processors. Department of Electrical Engineering Stanford University

Lecture 26: Parallel Processing. Spring 2018 Jason Tang

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

CS / ECE 6810 Midterm Exam - Oct 21st 2008

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

Keywords and Review Questions

Computer Architecture: Mul1ple Issue. Berk Sunar and Thomas Eisenbarth ECE 505

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Adapted from David Patterson s slides on graduate computer architecture

Instruction-Level Parallelism. Instruction Level Parallelism (ILP)

Solutions to exercises on Instruction Level Parallelism

Multiple Instruction Issue. Superscalars

In embedded systems there is a trade off between performance and power consumption. Using ILP saves power and leads to DECREASING clock frequency.

CS425 Computer Systems Architecture

Chapter 3 & Appendix C Part B: ILP and Its Exploitation

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Course on Advanced Computer Architectures

CS425 Computer Systems Architecture

Shared Symmetric Memory Systems

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

Adapted from instructor s. Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK]

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

Administrivia. CMSC 411 Computer Systems Architecture Lecture 14 Instruction Level Parallelism (cont.) Control Dependencies

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

COSC 6385 Computer Architecture. Instruction Level Parallelism

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

Processor (IV) - advanced ILP. Hwansoo Han

EEC 581 Computer Architecture. Lec 4 Instruction Level Parallelism

Functional Units. Registers. The Big Picture: Where are We Now? The Five Classic Components of a Computer Processor Input Control Memory

CS433 Homework 2 (Chapter 3)

Lecture: Static ILP. Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

CS433 Homework 2 (Chapter 3)

Transcription:

Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department University Carlos III of Madrid cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 1/55

Compilation techniques and ILP 1 Compilation techniques and ILP 2 Advanced branch prediction techniques 3 Introduction to dynamic scheduling 4 Speculation 5 Multiple issue techniques 6 ILP limits 7 Thread level parallelism 8 Conclusion cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 2/55

Compilation techniques and ILP Taking advantage of ILP ILP directly applicable to basic blocks. Basic block: sequence of instructions without branching. Typical program in MIPS: Basic block average size from 3 to 6 instructions. Low ILP exploitation within block. Need to exploit ILP across basic blocks. Example for ( i=0;i<1000;i++) { x[ i ] = x[ i ] + y[ i ]; } Loop level parallelism. Can be transformed to ILP. By compiler or hardware. Alternative: Vector instructions. SIMD instructions in processor. cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 3/55

Compilation techniques and ILP Scheduling and loop unrolling Parallelism exploitation: Interleave execution of unrelated instructions. Fill stalls with instructions. Do not alter original program effects. Compiler can do this with detailed knowledge of the architecture. cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 4/55

Compilation techniques and ILP ILP exploitation Example for ( i=999;i>=0;i ) { x[ i ] = x[ i ] + s; } Each iteration body is independent. Latencies between instructions Instruction Instruction Latency (cycles) producing result using result FP ALU operation FP ALU operation 3 FP ALU operation Store double 2 Load double FP ALU operation 1 Load double Store double 0 cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 5/55

Compilation techniques and ILP Compiled code R1 Last array element. F2 Scalar s. R2 Precomputed so that 8(R2) is the first element in array. Assembler code Loop : L.D F0, 0(R1) ; F0 < x [ i ] ADD.D F4, F0, F2 ; F4 < F0 + s S.D F4, 0(R1) ; x [ i ] < F4 DADDUI R1, R1, # 8 ; i BNE R1, R2, Loop ; Branch i f R1! =R2 cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 6/55

Compilation techniques and ILP Stalls in execution Original Loop : L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, # 8 BNE R1, R2, Loop Stalls Loop : L.D F0, 0(R1) < s t a l l > ADD.D F4, F0, F2 < s t a l l > < s t a l l > S.D F4, 0(R1) DADDUI R1, R1, # 8 < s t a l l > BNE R1, R2, Loop cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 7/55

Compilation techniques and ILP Loop scheduling Original Loop : L.D F0, 0(R1) < s t a l l > ADD.D F4, F0, F2 < s t a l l > < s t a l l > S.D F4, 0(R1) DADDUI R1, R1, # 8 < s t a l l > BNE R1, R2, Loop 9 cycles per iteration. Scheduled Loop : L.D F0, 0(R1) DADDUI R1, R1, # 8 ADD.D F4, F0, F2 < s t a l l > < s t a l l > S.D F4, 8(R1) BNE R1, R2, Loop 7 cycles per iteration. cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 8/55

Compilation techniques and ILP Loop unrolling Idea: Replicate loop body several times. Adjust termination code. Use different registers for each iteration replica to reduce dependencies. Effect: Increase basic block length. Increase use of available ILP. cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 9/55

Compilation techniques and ILP Unrolling Unrolling (x4) Loop : L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) L.D F6, 8(R1) ADD.D F8, F6, F2 S.D F8, 8(R1) L.D F10, 16(R1) Unrolling (x4) ADD.D F12, F10, F2 S.D F12, 16(R1) L.D F14, 24(R1) ADD.D F16, F14, F2 S.D F16, 24(R1) DADDUI R1, R1, # 32 BNE R1, R2, Loop 4 iterations require more registers. This example assumes that array size is multiple of 4. cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 10/55

Compilation techniques and ILP Stalls and unrolling Unrolling (x4) Loop : L.D F0, 0(R1) < s t a l l > ADD.D F4, F0, F2 < s t a l l > < s t a l l > S.D F4, 0(R1) L.D F6, 8(R1) < s t a l l > ADD.D F8, F6, F2 < s t a l l > < s t a l l > S.D F8, 8(R1) L.D F10, 16(R1) < s t a l l > Unrolling (x4) ADD.D F12, F10, F2 < s t a l l > < s t a l l > S.D F12, 16(R1) L.D F14, 24(R1) < s t a l l > ADD.D F16, F14, F2 < s t a l l > < s t a l l > S.D F16, 24(R1) DADDUI R1, R1, # 32 < s t a l l > BNE R1, R2, Loop 27 cycles for every 4 iterations 6.75 cycles per iteration. cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 11/55

Compilation techniques and ILP Scheduling and unrolling Unrolling (x4) Loop : L.D F0, 0(R1) L.D F6, 8(R1) L.D F10, 16(R1) L.D F14, 24(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 S.D F4, 0(R1) S.D F8, 8(R1) S.D F12, 16(R1) DADDUI R1, R1, # 32 S.D F16, 8(R1) BNE R1, R2, Loop Code reorganization. Preserve dependencies. Semantically equivalent. Goal: Make use of stalls. Update of R1 at enough distance from BNE. 14 cycles for every 4 iterations 3.5 cycles per iteration. cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 12/55

Compilation techniques and ILP Limits of loop unrolling Improvement is decreased with each additional unrolling. Improvement limited to stalls removal. Overhead amortized among iterations. Increase in code size. May affect to instruction cache miss rate. Pressure on register file. May generate shortage of registers. Advantages are lost if there are not enough available registers. cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 13/55

Advanced branch prediction techniques 1 Compilation techniques and ILP 2 Advanced branch prediction techniques 3 Introduction to dynamic scheduling 4 Speculation 5 Multiple issue techniques 6 ILP limits 7 Thread level parallelism 8 Conclusion cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 14/55

Advanced branch prediction techniques Branch prediction High impact of branches on programs performance. To reduce impact: Loop unrolling. Branch prediction: Compile time. Each branch handled isolated. Advanced branch prediction: Correlated predictors. Tournament predictors. cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 15/55

Advanced branch prediction techniques Dynamic scheduling Hardware reorders instructions execution to reduce stalls while keeping data flow and exceptions. Able to handle unknown cases at compile time: Cache misses/hits. Code less dependent on a concrete pipeline. Simplifies compiler. Permits the hardware speculation. cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 16/55

Advanced branch prediction techniques Correlated prediction If first and second branch are taken, third is NOT-taken. example if (a==2) { a=0; } if (b==2) { b=0; } if (a!=b) { f () ; } Maintains last branches history to select among several predictors. A (m, n) predictor: Uses the result of m last branches to select among 2 m predictors. Each predictor has n bits. Predictor (1, 2): Result of last branch to select among 2 predictors. cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 17/55

Advanced branch prediction techniques Size of predictor A predictor (m, n) has several entries for each branch address. Total size: S = 2 m n entries address Examples: (0, 2) with 4K entries 8 Kb (2, 2) with 4K entries 32 Kb (2, 2) with 1K entries 8 Kb cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 18/55

Advanced branch prediction techniques Miss rate Correlated predictor has less misses that simple predictor with same size. Correlated predictor has less misses than simple predictor with unlimited number of entries. cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 19/55

Advanced branch prediction techniques Tournament prediction Combines two predictors: Global information based predictor. Local information based predictor. Uses a selector to choose between predictors. Change among two selectors uses a saturation counter (2 bits). Advantage: Allows different behavior for integer and FP. SPEC: Integer benchmarks global predictor 40% FP benchmarks global predictor 15%. Uses: Alpha and AMD Opteron. cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 20/55

Advanced branch prediction techniques Intel Core i7 Predictor with two levels: Smaller first level predictor. Larger second level predictor as backup. Each predictor combines 3 predictors: Simple 2-bits predictor. Global history predictor. Exit-loop predictor (iterations counter). Besides: Indirect jumps predictor. Return address predictor. cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 21/55

Introduction to dynamic scheduling 1 Compilation techniques and ILP 2 Advanced branch prediction techniques 3 Introduction to dynamic scheduling 4 Speculation 5 Multiple issue techniques 6 ILP limits 7 Thread level parallelism 8 Conclusion cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 22/55

Introduction to dynamic scheduling Dynamic scheduling Idea: hardware reorders instruction execution to decrease stalls. Advantages: Compiled code optimized for one pipeline runs efficiently in another pipeline. Correctly manages dependencies that are unknown at compile time. Allows to tolerate unpredictable delays (e.g. cache misses). Drawback: More complex hardware. cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 23/55

Introduction to dynamic scheduling Dynamic scheduling Effects: Out of Order execution (OoO). Out of Order instruction finalization. May introduce WAR and WAW hazards. Separation of ID stage into two different stages: Issue: Decodes instruction and checks for structural hazards. Operands fetch: Waits until there is no data hazard and fetches operands. Instruction Fetch (IF): Fetches into instruction register or instruction queue. cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 24/55

Introduction to dynamic scheduling Dynamic scheduling techniques Scoreboard: Stalls issued instructions until enough resources are available and there is no data hazard. Examples: CDC 6600, ARM A8. Tomasulo Algorithm: Removes WAR and WAW dependencies with register renaming. Examples: IBM 360, Intel Core i7. cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 25/55

Speculation 1 Compilation techniques and ILP 2 Advanced branch prediction techniques 3 Introduction to dynamic scheduling 4 Speculation 5 Multiple issue techniques 6 ILP limits 7 Thread level parallelism 8 Conclusion cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 26/55

Speculation Branches and parallelism limits As parallelism increases, control dependencies become a harder problem. Branch prediction is not enough. Next step is speculation on branch outcome and execution assuming that speculation was right. Instructions fetched, issued and executed. Need of a mechanism to handle wrong speculations. cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 27/55

Speculation Components Ideas: Dynamic branch prediction: Selects instructions to be executed. Speculation: Executes before control dependencies are resolved and may eventually undone. Dynamic scheduling. To achieve this, must separate: passing instruction result to another instruction using it. Instruction finalization. IMPORTANT: Processor state (register file / memory) not updated until changes are confirmed. cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 28/55

Speculation Solution Reorder Buffer (ROB): When an instruction is finalized ROB is written. When execution is confirmed real target is written. Instructions read modified data from ROB. ROB entries: Instruction type: branch, store, register operation. Target: Register Id or memory address. Value: Instruction result value. Ready: Indication of instruction completion. cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 29/55

Multiple issue techniques 1 Compilation techniques and ILP 2 Advanced branch prediction techniques 3 Introduction to dynamic scheduling 4 Speculation 5 Multiple issue techniques 6 ILP limits 7 Thread level parallelism 8 Conclusion cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 30/55

Multiple issue techniques CPI < 1 CPI 1 Issue one instruction per cycle. Multiple issue processors (CPI < 1 IPC > 1): Statically scheduled superscalar processors. In-order execution. Variable number of instructions per cycle. Dynamically scheduled superscalar processors. Out-of-order execution. Variable number of instructions per cycle. VLIW processors (Very Long Instruction Word). Several instructions into a packet. Static scheduling. Explicit ILP by the compiler. cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 31/55

Multiple issue techniques Approaches to multiple issue Several approaches to multiple issue. Static superscalar. Dynamic superscalar. Speculative superscalar. VLIW/LIW. EPIC. cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 32/55

Multiple issue techniques Static superscalar Issue: Dynamic. Hazards detection: Hardware. Scheduling: Static. Discriminating feature: In-order execution. Examples: MIPS. ARM Cortex-A7. cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 33/55

Multiple issue techniques Dynamic Superscalar Issue: Dynamic. Hazards detection: Hardware. Scheduling: Dynamic. Discriminating feature: Out-of-Order execution with no speculation. Examples: None. cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 34/55

Multiple issue techniques Speculative superscalar Issue: Dynamic. Hazards detection: Hardware. Scheduling: Speculative dynamic. Discriminating feature: Out-of-Order execution with speculation. Examples: Intel Core i3, i5, i7. AMD Phenom. IBM Power 7 cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 35/55

Multiple issue techniques VLIW Packs several operations into a single instruction. Example instruction in ISA VLIW: One integer instruction or a branch. Two independent floating point operations. Two independent memory references. IMPORTANT: Code must exhibit enough parallelism. cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 36/55

Multiple issue techniques VLIW / LIW Issue: Static. Hazards detection: Mostly software. Scheduling: Static. Discriminating feature: All hazards determined and specified by the compiler. Examples: DSPs (e.g. TI C6x). cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 37/55

Multiple issue techniques Problems with VLIW Drawbacks from original VLIW model: Complexity of finding statically enough parallelism. Generated code size. No hazard detection hardware. More binary compatibility problems than in regular superscalar designs. EPIC tries to solve most of this problems. cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 38/55

Multiple issue techniques EPIC Issue: Mostly static. Hazards detection: Mostly software. Scheduling: Mostly static. Discriminating feature: All hazards determined and specified by compiler. Examples: Itanium. cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 39/55

ILP limits 1 Compilation techniques and ILP 2 Advanced branch prediction techniques 3 Introduction to dynamic scheduling 4 Speculation 5 Multiple issue techniques 6 ILP limits 7 Thread level parallelism 8 Conclusion cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 40/55

ILP limits ILP limits To study maximum ILP we model an ideal processor. Ideal processor: Infinite register renaming: All WAR and WAW hazards can be avoided. Perfect branch prediction: All conditional branch predictions are a hit. Perfect jump prediction: All jumps (include returns) are correctly predicted. Perfect memory address alias analysis: A load can be safely moved before a store if address is not identical Perfect caches: All cache accesses require one clock cycle (always hit). cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 41/55

ILP limits Available ILP tomcatv 150.1 doduc 118.7 fppp 75.2 li 17.9 expresso gcc 62.6 54.8 20 40 60 80 100 120 140 160 Instructions per cycle cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 42/55

ILP limits However... More ILP implies more control logic: Smaller caches. Longer clock cycle. Higher energy consumption. Practical limitation: Issue from 3 to 6 instructions per cycle. cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 43/55

Thread level parallelism 1 Compilation techniques and ILP 2 Advanced branch prediction techniques 3 Introduction to dynamic scheduling 4 Speculation 5 Multiple issue techniques 6 ILP limits 7 Thread level parallelism 8 Conclusion cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 44/55

Thread level parallelism Why TLP? Some applications exhibit more natural parallelism than the achieved with ILP. Servers, scientific applications,... Two models emerge: Thread level parallelism (TLP): Thread: Process with its own instructions and data. May be either part of a program or an independent program. Each thread has an associated state (instructions, data, PC, registers,... ). Data level parallelism (DLP): Identical operation on different data items. cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 45/55

Thread level parallelism TLP ILP exploits implicit parallelism within a basic block or a loop. TLP uses multiple threads of execution inherently parallel. TLP Goal: Use multiple instruction flows to improve: Throughput in computers using many programs. Execution time of multi-threaded programs. cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 46/55

Thread level parallelism Multi-threaded execution Multiple threads share processor functional units overlapping its use. Need to replicate state n-times. Register file, PC, page table (when threads do note belong to the same program). Shared memory through virtual memory mechanisms. Hardware for fast thread context switch. Kinds: Fine grain: Thread switch in every instruction. Coarse grain: Thread switch in stalls (e.g. Cache miss). Simultaneous: Fine grain with multiple-issue and dynamic scheduling. cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 47/55

Thread level parallelism Fine-grain multithreading Switches between threads in each instruction. Interleaves thread execution. Usually does round-robin. Threads in a stall are excluded from round-robin. Processor must be able to switch every clock cycle. Advantage: Can hide short and long stalls. Drawback: Delays individual thread execution due to sharing. Example: Sun Niagara. cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 48/55

Thread level parallelism Coarse grain multithreading Switch thread only on long stalls. Example: L2 cache miss. Advantages: No need for a highly fast thread switch. Does not delay individual threads. Drawbacks: Must flush or freeze the pipeline. Needs to fill pipeline with instructions from the new thread (latency). Appropriate when filling the pipeline takes much shorter than a stall. Example: IBM AS/400. cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 49/55

Thread level parallelism SMT: Simultaneous multithreading Idea: Dynamically scheduled processors already have many mechanisms to support multithreading. Large sets of virtual registers. Registers for multiple threads. Register renaming. Avoid conflicts in access to registers from threads. Out-of-order finalization. Modifications: Per thread renaming table. Separate PC registers. Separate ROB. Examples: Intel Core i7, IBM Power 7 cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 50/55

Thread level parallelism TLP: Summary Superscalar Fine Grain Coarse Grain SMT Multiprocessor Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 stall cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 51/55

Conclusion 1 Compilation techniques and ILP 2 Advanced branch prediction techniques 3 Introduction to dynamic scheduling 4 Speculation 5 Multiple issue techniques 6 ILP limits 7 Thread level parallelism 8 Conclusion cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 52/55

Conclusion Summary Loop unrolling allows hiding stall latencies, but offers a limited improvement. Dynamic scheduling manages stalls unknown at compile-time. Speculative techniques built on branch prediction and dynamic scheduling. Multiple issue in ILP is limited in practice from 3 to 6 instructions. SMT is an approach to TLP within one core. cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 53/55

Conclusion References Computer Architecture. A Quantitative Approach 5th Ed. Hennessy and Patterson. Sections 3.1, 3.2, 3.3, 3.4, 3.6, 3.7, 3.10, 3.12. Recommended exercises: 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.11, 3.14, 3.17. cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 54/55

Conclusion Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department University Carlos III of Madrid cbed Computer Architecture ARCOS Group http://www.arcos.inf.uc3m.es 55/55