INSTRUCTION LEVEL PARALLELISM

Similar documents
DYNAMIC INSTRUCTION SCHEDULING WITH SCOREBOARD

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

DYNAMIC SPECULATIVE EXECUTION

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

CS425 Computer Systems Architecture

Multi-cycle Instructions in the Pipeline (Floating Point)

Hardware-based Speculation

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Getting CPI under 1: Outline

Exploitation of instruction level parallelism

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Lecture: Static ILP. Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

EECC551 Exam Review 4 questions out of 6 questions

Lecture 9: Multiple Issue (Superscalar and VLIW)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

Hardware-Based Speculation

Four Steps of Speculative Tomasulo cycle 0

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

Course on Advanced Computer Architectures

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás

Processor (IV) - advanced ILP. Hwansoo Han

Super Scalar. Kalyan Basu March 21,

Instruction-Level Parallelism and Its Exploitation

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

TDT 4260 lecture 7 spring semester 2015

5008: Computer Architecture

The Processor: Instruction-Level Parallelism

IF1/IF2. Dout2[31:0] Data Memory. Addr[31:0] Din[31:0] Zero. Res ALU << 2. CPU Registers. extension. sign. W_add[4:0] Din[31:0] Dout[31:0] PC+4

Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

HY425 Lecture 09: Software to exploit ILP

COMPUTER ORGANIZATION AND DESI

HY425 Lecture 09: Software to exploit ILP

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Lecture 10: Static ILP Basics. Topics: loop unrolling, static branch prediction, VLIW (Sections )

Multiple Instruction Issue. Superscalars

Multiple Issue ILP Processors. Summary of discussions

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 15 Very Long Instruction Word Machines

EEC 581 Computer Architecture. Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW)

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Lec 25: Parallel Processors. Announcements

Lecture: Pipeline Wrap-Up and Static ILP

計算機結構 Chapter 4 Exploiting Instruction-Level Parallelism with Software Approaches

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)

Advanced Instruction-Level Parallelism

These slides do not give detailed coverage of the material. See class notes and solved problems (last page) for more information.

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 14 Very Long Instruction Word Machines

Updated Exercises by Diana Franklin

LECTURE 10. Pipelining: Advanced ILP

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

Metodologie di Progettazione Hardware-Software

Instruction Level Parallelism

CS425 Computer Systems Architecture

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

Static vs. Dynamic Scheduling

Hardware-Based Speculation

EE 4683/5683: COMPUTER ARCHITECTURE

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

What is ILP? Instruction Level Parallelism. Where do we find ILP? How do we expose ILP?

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

CS433 Homework 2 (Chapter 3)

CS / ECE 6810 Midterm Exam - Oct 21st 2008

Adapted from David Patterson s slides on graduate computer architecture

Computer Science 246 Computer Architecture

ILP: Instruction Level Parallelism

Advanced Computer Architecture

Lecture 13 - VLIW Machines and Statically Scheduled ILP

MIPS ISA AND PIPELINING OVERVIEW Appendix A and C

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

Advanced Computer Architecture

TDT 4260 TDT ILP Chap 2, App. C

ESE 545 Computer Architecture Instruction-Level Parallelism (ILP): Speculation, Reorder Buffer, Exceptions, Superscalar Processors, VLIW

CS433 Homework 2 (Chapter 3)

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation.

Chapter 4 The Processor 1. Chapter 4D. The Processor

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Transcription:

INSTRUCTION LEVEL PARALLELISM Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 2 and Appendix H, John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2011 ADVANCED COMPUTER ARCHITECTURES ARQUITECTURAS AVANÇADAS DE COMPUTADORES (AAC)

Outline 2 Increasing processing speedup: Speedup from pipelining Speedup from instruction scheduling and loop unrolling Instruction Level Parallelism Static (compiler) vs Dynamic (processor) instruction scheduling approaches VLIW processors

Review of pipelining architectures 3 5 stage pipeline: IF, ID, EX, MEM, WB Data hazards solved by: WB stages uses half cycle (writes on falling edge) Forwarding from MEM and WB stages to EX Control hazards: Stall until dependency is solved (i.e., next PC is known) Delayed branch Static Branch Prediction What is the speedup from pipelining?

4 Evaluating pipelining architectures Speedups Speedup = T No Pipeline T Pipeline = #Cycles T CLK No Pipeline #Cycles T CLK Pipeline = CPI No Pipeline CPI Pipeline CPI No Pipeline = #Cycles T CLK No Pipeline #Cycles T CLK Pipeline T CLK No Pipeline T CLK Pipeline = #Stalls CPI Ideal + #Instructions #Stages = #Stalls 1 + #Instructions #Instructions #Instructions T CLK No Pipeline T CLK Pipeline CPI No Pipeline = 1 CPI Ideal = 1 Considering that pipelining is performed by evenly balancing the critical path of all stages: #Stages = T CLK No Pipeline T CLK Pipeline Objective: Maximize the number of pipeline stages while minimizing the average number of stalls per instruction Notice: keep the critical path of all pipeline stages balanced

Evaluating pipelining architectures Speedups 5 Ganho energético: E No Pipeline E Pipeline = T No Pipeline P No Pipeline T Pipeline P Pipeline = Speedup (P din + P static ) No Pipeline = (P din + P static ) Pipeline N 1 + SPI α C V2 f + V 2 /R α C V 2 N f + V 2 /R Static Power Consumption P static = V I = V 2 /R Where R is the effective resistance for the leakage currents Dynamic Power Consumption P din = α C V 2 f Where α is a switching factor and C is the effective capacitance of the IC N Number of pipeline stages SPI Average number of stalls per instruction

Evaluating pipelining architectures Influence of control instructions 6 Consider a typical 5-stage pipelining architecture with: Around 15% of instructions are conditional or unconditional jumps In 60% of cases, the sequence of instructions is broken All data dependencies are resolved by forwarding Resolution of control hazards Penalty cycles Stalls per instruction Speedup Pipeline Stall (BR CTRL @ EX stage) 2 0.30 3.85 2 penalty cycles times 15% of instructions, resulting in an average of 0.45 stalls per instruction Number of Stages Speedup = 1 + stalls per instruction

Evaluating pipelining architectures Influence of control instructions 7 Consider a typical 5-stage pipelining architecture with: Around 15% of instructions are conditional or unconditional jumps In 60% of cases, the sequence of instructions is broken All data dependencies are resolved by forwarding Resolution of control hazards Penalty cycles Stalls per instruction Speedup Pipeline Stall (BR CTRL @ EX stage) 2 0.30 3.85 Branch Predict Taken 1 0.15 4.35 A static branch predict taken strategy has no performance benefit since the jump address is known only in later pipeline stages. Thus, 1 penalty cycles times 15% of instructions, results in an average of 0.15 stalls per instruction Speedup = Number of Stages 1 + stalls per instruction

Evaluating pipelining architectures Influence of control instructions 8 Consider a typical 5-stage pipelining architecture with: Around 15% of instructions are conditional or unconditional jumps In 60% of cases, the sequence of instructions is broken All data dependencies are resolved by forwarding Resolution of control hazards Penalty cycles Stalls per instruction Speedup Pipeline Stall (BR CTRL @ EX stage) 2 0.30 3.85 Branch Predict Taken 1 0.15 4.35 Branch Predict Not Taken 1 0.09 4.59 A static not taken branch prediction strategy has no performance loss when the jump is not taken since the instructions are already on the pipeline. Thus, we only have a penalty when the jump is taken. Thus, 1 penalty cycles times 15% of instructions times 60% of the cases where the jump is taken, results in an average of 0.09 stalls per instruction Number of Stages Speedup = 1 + stalls per instruction

Evaluating pipelining architectures Influence of control instructions 9 Consider a typical 5-stage pipelining architecture with: Around 15% of instructions are conditional or unconditional jumps In 60% of cases, the sequence of instructions is broken All data dependencies are resolved by forwarding Resolution of control hazards Penalty cycles Stalls per instruction Speedup Pipeline Stall (BR CTRL @ EX stage) 2 0.30 3.85 Branch Predict Taken 1 0.15 4.35 Branch Predict Not Taken 1 0.09 4.59 Delayed Branch 0.5 (50%) 0.08 4.63 A delayed branch strategy has no performance loss if the delay slots can be filled with useful instructions. However in 50% of times the delay slot remains empty (i.e., a NOP instruction is used to fill the slot). Thus, 1 penalty cycles times 50% of cases times 15% of instructions, results in an average of 0.08 stalls per instruction Number of Stages Speedup = 1 + stalls per instruction

Evaluating pipelining architectures Influence of control instructions 10 Consider a typical 5-stage pipelining architecture with: Around 15% of instructions are conditional or unconditional jumps In 60% of cases, the instruction sequence is broken (the branch is taken) All data dependencies are resolved by forwarding Resolution of control hazards Penalty cycles Stalls per instruction Speedup Pipeline Stall (BR CTRL @ EX stage) 2 0.30 3.85 Branch Predict Taken 1 0.150 4.35 Branch Predict Not Taken 1 0.090 4.59 Delayed Branch 0.5 (50%) 0.075 4.63 Delayed Branch + Branch Predict Not Taken 0.5 (50%) 0.045 4.78 Number of Stages Speedup = 1 + stalls per instruction

11 Static instruction scheduling Can we improve processor performance just by wisely re-ordering the instructions?

Static instruction scheduling Example architecture 12 IF ID EX MEM WB MIPS architecture: 1 delayed branch slot Forwarding Producing instruction Consuming instruction INT ALU Store FP ALU INT ALU 0 0 0 Load 1 0 1 FP ALU - 2 3 Instruction Latency Number of wait cycles from generating the result and consuming it, e.g.,: Latency INT ALUINT ALU is zero, which means that the instructions can be consecutive Latency LoadINT ALU is one; if the instructions appear consecutive, one stall cycle will be generated

Static instruction scheduling Example assembly code 13 double X[100]; for (i=100; i >0; i--) X[i] = X[i] + k; ; R2 contains the address of X[99] ; F1 contains the value of K Cont: L.D F0,0(R2) ; F0 M[0+R2] (load double) ADD.D F2,F0,F1 ; F2 F0 + F1 (add double) S.D 0(R2),F2 ; M[0+R2] F2 (store double) DSUBI R2,R2,#8 ; R2 R2 8 BNE R2,R1,Cont ; PC Cont if R2 R1 Note: Registers F0,F1,F2, are used to store double precision floating point numbers

Static instruction scheduling Program execution 14 Cont: L.D F0,0(R2) Stall ADD.D F2,F0,F1 Stall Stall Producing instruction Consuming instruction INT ALU Store FP ALU INT ALU 0 0 0 Load 1 0 1 FP ALU - 2 3 S.D DSUBI BNE Stall 0(R2),F2 R2,R2,#8 R2,R1,Cont Cont: L.D F0,0(R2) ADD.D F2,F0,F1 9 cycles per iteration S.D DSUBI BNE 0(R2),F2 R2,R2,#8 R2,R1,Cont

Static instruction scheduling Instruction re-ordering 15 Producing instruction Consuming instruction INT ALU Store FP ALU INT ALU 0 0 0 Load 1 0 1 FP ALU - 2 3 Cont: L.D F0,0(R2) Cont: L.D F0,0(R2) ADD.D F2,F0,F1 ADD.D F2,F0,F1 DSUBI R2,R2,#8 S.D 0(R2),F2 BNED R2,R1,Cont DSUBI R2,R2,#8 S.D 8(R2),F2 BNE R2,R1,Cont

Static instruction scheduling Speedup from instruction re-ordering 16 Cont: L.D F0,0(R2) Stall ADD.D F2,F0,F1 DSUBI R2,R2,#8 BNED R2,R1,Cont Producing instruction Consuming instruction INT ALU Store FP ALU INT ALU 0 0 0 Load 1 0 1 FP ALU - 2 3 S.D -8(R2),F2 6 cycles per iteration Cont: L.D F0,0(R2) ADD.D F2,F0,F1 Re scheduling speedup = 9 6 = 1.5 DSUBI BNED S.D R2,R2,#8 R2,R1,Cont 8(R2),F2 50% faster

Static instruction scheduling Speedup from loop unrolling 17 Cont: L.D F0,0(R2) L.D F2,-8(R2) L.D F3,-16(R2) L.D F4,-24(R2) ADD.D F0,F0,F1 ADD.D F2,F2,F1 ADD.D F3,F3,F1 ADD.D F4,F4,F1 SD 0(R2),F0 SD -8(R2),F2 SD -16(R2),F3 SUB R2,R2,#32 BNED R2,R1,Cont S.D 8(R2),F4 14 cycles per 4 iterations = 3.5 cycles per iteration Cont: L.D F0,0(R2) L.D F2,-8(R2) L.D F3,-16(R2) L.D F4,-24(R2) ADD.D F0,F0,F1 ADD.D F2,F2,F1 ADD.D F3,F3,F1 ADD.D F4,F4,F1 SD 0(R2),F0 SD -8(R2),F2 SD -16(R2),F3 SUB R2,R2,#32 BNED R2,R1,Cont Re scheduling speedup = 9 3.5 = 2.57 S.D 8(R2),F4

18 Instruction Level Parallelism Additional speedups can be obtained by issuing multiple instructions in a single clock cycle: Explores instruction-level parallelism (ILP) The simultaneous execution of up to N instructions per clock cycle allows decreasing the ideal CPI from 1 to 1/N

Exploring Instruction Level Parallelism (ILP) Static vs Dynamic Approaches 19 Very Long Instruction Word (VLIW) processors (e.g., Itanium): Static instruction re-scheduling HAZARDS: Solved by the compiler by statically analysing the dependencies Allows more complex analysis Superscalar processors (e.g., Intel, ARM): Dynamic instruction re-scheduling HAZARDS: Solved in real-time using dedicated hardware structures Simpler analysis but can take into account conflicts that are only visible during execution Requires: Multiple functional units Simultaneous fetch and decode of multiple instructions Leads to additional conflicts

Exploring Instruction Level Parallelism (ILP) Static vs Dynamic Approaches 20 Very Long Instruction Word (VLIW) processors (e.g., Itanium): static instruction re-scheduling INSTRUCTION SCHEDULING: The compiler analyses the dependencies and identifies instruction level parallelism (ILP) Using the collected information, the compiler generates groups of instructions (packets) that can be executed in parallel Allows reducing the hardware resources for controlling the processor, but cannot extract parallelism only visible during execution Superscalar processors (e.g., Intel, ARM): Dynamic instruction re-scheduling INSTRUCTION SCHEDULING: Dynamic resolution of conflicts that takes into account information only known during execution; typical algorithms: Scoreboard (centralized) and Tomasulo (distributed) Uses out-of-order instruction execution and register renaming to solve the dependencies and extract ILP Leads to more complex control mechanisms and requires additional hardware resources

Exploring Instruction Level Parallelism (ILP) Static vs Dynamic Approaches 21 Very Long Instruction Word (VLIW) processors (e.g., Itanium): static instruction re-scheduling INSTRUCTION SCHEDULING: Example case where parallelism cannot be extracted Superscalar processors (e.g., Intel, ARM): Dynamic instruction re-scheduling INSTRUCTION SCHEDULING: Example case where parallelism is only known during instruction execution S.D L.D 100(R1),R2 R3,20(R4) 100+R1 = 20+R4?

Exploring Instruction Level Parallelism (ILP) VLIW processors 22 The compiler formats the instructions in a shape of packets Each packet consists on a set of independent instructions If there are dependencies, they must be explicitly marked The need for conflict identification is substantially reduced In each clock cycle: The Instruction Fetch (IF) stage fetches a packet from memory The Instruction Decode (ID) stage decodes the complete packet and issues it to execution

23 Very Long Instruction Word (VLIW) Processors MIPS extension to VLIW MIPS-VLIW extension includes 5 different execution units, which translates into packets of 5 instructions, maximum # Memory 1 Memory 2 FP 1 FP 2 Integer 1 L.D F0,0(R0) L.D F2,-8(R0) 2 L.D F3,-16(R0) L.D F4,-24(R0) 3 L.D F5,-32(R0) L.D F6,-40(R0) ADD.D F0,F0,F1 ADD.D F2,F2,F1 4 L.D F7,-48(R0) L.D F8,-56(R0) ADD.D F3,F3,F1 ADD.D F4,F4,F1 5 L.D F9,-64(R0) L.D F10,-72(R0) ADD.D F5,F5,F1 ADD.D F6,F6,F1 6 ADD.D F7,F7,F1 ADD.D F8,F8,F1 7 S.D 0(R0),F0 S.D -8(R0),F2 ADD.D F9,F9,F1 ADD.D F10,F10,F1 8 S.D -16(R0),F3 S.D -24(R0),F4 9 S.D -32(R0),F5 S.D -40(R0),F6 DSUBI R0,R0,-80 10 S.D 32(R0),F5 S.D 24(R0),F6 BNED R0,R1,Cont 11 S.D 16(R0),F5 S.D 8(R0),F6 Takes 11 cycles to execute 10 iterations of the loop Average of 1.1 cycles per iteration Speedup vs orignal case = 9 1.1 = 8.18 Speedup vs reordered = 3.5 1.1 = 3.18

24 Very Long Instruction Word (VLIW) Processors MIPS extension to VLIW MIPS-VLIW extension includes 5 different execution unit, which translates into packets of 5 instructions, maximum # Memory 1 Memory 2 FP 1 FP 2 Integer 1 L.D F0,0(R0) L.D F2,-8(R0) 2 L.D F3,-16(R0) L.D F4,-24(R0) 3 L.D F5,-32(R0) L.D F6,-40(R0) ADD.D F0,F0,F1 ADD.D F2,F2,F1 4 L.D F7,-48(R0) L.D F8,-56(R0) ADD.D F3,F3,F1 ADD.D F4,F4,F1 5 L.D F9,-64(R0) L.D F10,-72(R0) ADD.D F5,F5,F1 ADD.D F6,F6,F1 6 ADD.D F7,F7,F1 ADD.D F8,F8,F1 7 S.D 0(R0),F0 S.D -8(R0),F2 ADD.D F9,F9,F1 ADD.D F10,F10,F1 8 S.D -16(R0),F3 S.D -24(R0),F4 9 S.D -32(R0),F5 S.D -40(R0),F6 DSUBI R0,R0,-80 10 S.D 32(R0),F5 S.D 24(R0),F6 BNE R0,R1,Cont 11 S.D 16(R0),F5 S.D 8(R0),F6 Which of the two speedups should be used to compare the architectures? What is the maximum achievable speedup? Speedup vs orignal case = 9 1.1 = 8.18 Speedup vs reordered = 3.5 1.1 = 3.18

Very Long Instruction Word (VLIW) Processors Drawbacks 25 Requires a great amount of floating point registers to enable exposing the parallelism To fully expose the parallelism and maximise the use of the available functional units it is necessary to deeply apply loop unrolling Even in such cases, the functional unit utilization is low E.g., for the previous case we achieve an average functional unit utilization of 58% Actual speedup is lower than ideal In the previous case, the ideal speedup is 5x, whereas the real value (considering that the critical path remains the same) is 3.18x VLIW processors require a large bandwidth to the register file The previous case requires reading from 4 FP and 4 integer registers and writing to 4 FP and 3 integer registers Code incompatibility is a large drawback

Moving beyond VLIW processors Explicitly Parallel Instruction Computing (EPIC) 26 VLIW ISAs are not backward compatible between implementations: I.e., different packet width sizes and/or the availability of a different set of functional units The variability of the memory access times (due to CPU caches and RAM) introduces difficulties to the compiler Explicitly Parallel Instruction Computing (EPIC) Instead of grouping instructions in packets, group them in bundles and use a stop bit to indicate dependencies between bundles New load instructions to decrease memory access time variability E.g., software prefetching and speculative loading New branch instructions that combine multiple branch conditions in a single bundle Predicative execution modes that allow a bundle to be executed only upon a condition

Static scheduling architectures Intel IA-64 Architecture 27 EPIC (Explicit Parallel Instruction Computer) The compiler identifies the parallelism and schedule the instruction execution indicating which instructions can be performed in parallel Includes hardware support for instruction scheduling Instructions are organized in: Groups of instructions that can be executed in parallel Coded in 128-bit bundles of 3 instructions Registers: 128 x 64-bit registers with 1 poison bit/register (32GPR+96Stack) 128 x 82-bit registers for FP (using IEEE 80-bit format) 64 x 1-bit predication register 8 x 64-bit register for indirect branches

Static scheduling architectures Intel IA-64 Architecture 28 Execution Unit Unit-I Unit-M Instruction Type Integer ALU (A) Non-Integer ALU (I) Integer ALU (A) Memory Access (M) Example Addition, subtraction,... bit test, move,... Addition, subtraction,... Integer/FP load/store Unit-F FP (F) Floating point operations Unit-B Saltos (B) Jumps/Calls 24 possible bundle patterns (» indicates new parallel section, i.e., end of parallelism) Pattern 0: M I I Pattern 1: M I I» Pattern 2: M I» I Pattern 10: M» M I Pattern 29: M F B» L+X Extendidas L+X Extended immediate Each instruction is coded in 41-bits The bundle 5 most significant bits state the bundle pattern The 6 least significant bits specify predication

Static scheduling architectures Intel IA-64 Architecture 29 Itanium Itanium2 Pipeline Stages 10 8 Issued instructions per clock cycle 6 6 Functional Units - Integer (Type I) 2 2 - Load/Store (Type M) 2 4 - Floating Point (Type B) 3 3 - Branch (Type L+X) 3 3 Latencies - Floating Point 4 4 - Branch Miss prediction Up to 9 Up to 6

Architectures comparison Intel IA-64 (Itanium2) vs IA32e (Pentium 4) 30 483.xalancbmk 473.astar 471.omnetpp 464.h264ref 462.libquantum 458.sjeng 456.hmmer 445.gobmk 429.mcf 403.gcc 401.bzip2 400.pearlbench SPEC CPU INT2006 Pentium 4@3.8GHz Itanium2@1.6GHz 100 300 500 700 900 1100 1300 1500 Time [s]

Architectures comparison Intel IA-64 vs IA32e 31 Core 2 E6300 @ 2.66GHz (ICC 9) (2007) Pentium 4 @ 3.07GHz (ICC 7) Pentium 4 @ 1.3GHz (ICC 5) (2002) CFP2000 CINT2000 Itanium 2 @ 1.6GHz, 6M L3 (ICC 8) Itanium 2 @ 1GHz, 3M L3 (ICC 7) (2002) Itanium @ 0.8GHz (ICC 5) (2001) 0 500 1000 1500 2000 2500 3000

32 Next lesson Compiler techniques to extract parallelism Local techniques Global techniques

Very Long Instruction Word (VLIW) Processors Extracting ILP and scheduling 33 Local techniques Parallelism is simpler to explore if loop unrolling leads to a large number of sequential instructions Software Pipelining is an effective technique for extracting parallelism and to schedule instructions through symbolic loop unrolling Global techniques Identification and exploration of parallelism requires instruction shifting to resolve control dependencies; complex algorithms must be used and these achieve sub-optimal solutions Techniques that use hardware support to extract parallelism can also be used

Parallelizing loops 34 To extract parallelism from loops, each iteration must be independent from the previous Parallelizable loop: for (i=0;i<=;i++) A[i] = A[i] + K; Iteration i is independent of iteration j Iterations can be performed in any order since they are independent Non-parallelizable loop: for (i=1;i<=;i++) A[i] = A[i-1] + K; Iteration i is dependent of iteration j Iterations i must be performed before iteration i+1