The IA-64 Architecture. Salient Points

Similar documents
UCI. Intel Itanium Line Processor Efforts. Xiaobin Li. PASCAL EECS Dept. UC, Irvine. University of California, Irvine

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation.

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.

Advanced Computer Architecture

VLIW/EPIC: Statically Scheduled ILP

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 15 Very Long Instruction Word Machines

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 14 Very Long Instruction Word Machines

Agenda. What is the Itanium Architecture? Terminology What is the Itanium Architecture? Thomas Siebold Technology Consultant Alpha Systems Division

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

CS425 Computer Systems Architecture

Getting CPI under 1: Outline

INSTRUCTION LEVEL PARALLELISM

Lecture 13 - VLIW Machines and Statically Scheduled ILP

5008: Computer Architecture

Advance CPU Design. MMX technology. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. ! Basic concepts

Lecture 9: Multiple Issue (Superscalar and VLIW)

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Advanced issues in pipelining

Intel released new technology call P6P

CS 152 Computer Architecture and Engineering. Lecture 13 - VLIW Machines and Statically Scheduled ILP

Basic Computer Architecture

Multiple Issue ILP Processors. Summary of discussions

Lecture 7: Static ILP and branch prediction. Topics: static speculation and branch prediction (Appendix G, Section 2.3)

The Processor: Instruction-Level Parallelism

Processor (IV) - advanced ILP. Hwansoo Han

計算機結構 Chapter 4 Exploiting Instruction-Level Parallelism with Software Approaches

In-order vs. Out-of-order Execution. In-order vs. Out-of-order Execution

November 7, 2014 Prediction

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

Multiple Instruction Issue. Superscalars

CS311 Lecture: Pipelining, Superscalar, and VLIW Architectures revised 10/18/07

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version:

CS 152 Computer Architecture and Engineering. Lecture 16 - VLIW Machines and Statically Scheduled ILP

Chapter 06: Instruction Pipelining and Parallel Processing

Control Hazards. Prediction

Wide Instruction Fetch

Lecture 7: Static ILP, Branch prediction. Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections )

Chapter 3 (Cont III): Exploiting ILP with Software Approaches. Copyright Josep Torrellas 1999, 2001, 2002,

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

HY425 Lecture 09: Software to exploit ILP

HY425 Lecture 09: Software to exploit ILP

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Chapter 4 The Processor (Part 4)

CS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin. School of Information Science and Technology SIST

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

The Processor: Improving the performance - Control Hazards

CS311 Lecture: Pipelining and Superscalar Architectures

Dynamic Control Hazard Avoidance

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)

CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

Control Hazards. Branch Prediction

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

Administration CS 412/413. Instruction ordering issues. Simplified architecture model. Examples. Impact of instruction ordering

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

Performance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model.

Last time: forwarding/stalls. CS 6354: Branch Prediction (con t) / Multiple Issue. Why bimodal: loops. Last time: scheduling to avoid stalls

COMPUTER ORGANIZATION AND DESI

Instruction Level Parallelism (Branch Prediction)

EEC 581 Computer Architecture. Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW)

Handout 2 ILP: Part B

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University

Advanced processor designs

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

Computer Science 146. Computer Architecture

Lecture: Static ILP. Topics: predication, speculation (Sections C.5, 3.2)

Last lecture. Some misc. stuff An older real processor Class review/overview.

ESE 545 Computer Architecture Instruction-Level Parallelism (ILP): Speculation, Reorder Buffer, Exceptions, Superscalar Processors, VLIW

Alpha and IA64. Executive Summary

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

Superscalar Processors Ch 14

Computer Science 246 Computer Architecture

CS252 Spring 2017 Graduate Computer Architecture. Lecture 8: Advanced Out-of-Order Superscalar Designs Part II

Itanium 2 Processor Microarchitecture Overview

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

CPSC 313, 04w Term 2 Midterm Exam 2 Solutions

Advanced Instruction-Level Parallelism

IA-64 Architecture Innovations

ECE 4750 Computer Architecture, Fall 2015 T16 Advanced Processors: VLIW Processors

Full Datapath. Chapter 4 The Processor 2

Page # Let the Compiler Do it Pros and Cons Pros. Exploiting ILP through Software Approaches. Cons. Perhaps a mixture of the two?

Ti Parallel Computing PIPELINING. Michał Roziecki, Tomáš Cipr

TECH. CH14 Instruction Level Parallelism and Superscalar Processors. What is Superscalar? Why Superscalar? General Superscalar Organization

CS 152, Spring 2011 Section 8

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

CMSC411 Fall 2013 Midterm 2 Solutions

Four Steps of Speculative Tomasulo cycle 0

Chapter 4. The Processor

Hardware-based Speculation

IF1/IF2. Dout2[31:0] Data Memory. Addr[31:0] Din[31:0] Zero. Res ALU << 2. CPU Registers. extension. sign. W_add[4:0] Din[31:0] Dout[31:0] PC+4

Transcription:

The IA-64 Architecture Department of Electrical Engineering at College Park OUTLINE: Architecture overview Background Architecture Specifics UNIVERSITY OF MARYLAND AT COLLEGE PARK Salient Points 128 Registers (1KB cache) Eliminates renaming Reduces cache accesses Speculative Loads Moves loads past control barriers Predicated Execution Eliminates 40% of conditional branches First time we ve seen all these in one place

Goals of Architecture Overcome performance limiters: Branches Memory latency Sequential program model Long Architecture Lifetime Large register file Fully interlocked architecture No fixed issue-width Retain Backward Compatibility with x86 Background In the Beginning... CISC RISC -> larger register files, simpler design (better IC technologies) VLIW -> possible to do lots in parallel Influenced superscalar design, dynamic scheduling Superscalar Design: Dependency-checking, I-dispatch, register renaming, out-of-order A lot of hardware required

Performance Limiters: Branches 20-30% Execution Time on Today s Processors: MISPREDICTS 4-issue, 8-stage pipe, BR resolved stage 7: 24 issue slots squashed per mispredict Mispredicts: 95% prediction accuracy Branches = 1/6 instrs 24 stalls ------------ 120 instr ALSO: break program into basic blocks Severely limits opportunities for parallelism Performance Limiters: Memory Latency Processor Speed/Performance = 60% per yr Memory Technology = 5% per yr Not Just Memory => Cache Technology: Today s caches often pipelined 2+ cycle latency (assumes AGI) load use => load STALL use COMPOUNDED BY ISSUE WIDTH: STALL *= issue-width

Performance Limiters: Sequential Program Model Today: Prog. Lang compiler Parallel Representation compiler+assembler Sequential Program Model: Serial Representation hardware Parallel Execution Performance Limiters: Sequential Program Model Sequential model undoes much of the compiler s work we spend HUGE amounts of die area to recover it Allows artificial sequentiality to creep in Goal: Prog. Lang compiler Parallel Representation hardware Parallel Execution

Obvious Answer: VLIW (Very Long Instruction Word) Where did VLIW fail that Intel/HP can win? Code expansion A B NOP NOP Binary compatibility A B C D vs: A B Cannot exploit full parallelism: Control instructions serialize execution Cannot move loads above branches Intel s Solution: EPIC (Explicitly Parallel Instruction Computing) PREDICATED EXECUTION eliminates if-then-else Merced SPECULATIVE LOADS allow crossing control LARGE REGISTER FILE enables prefetches, reduces cache misses EPIC IA-64 VARIABLE INSTRUCTION WIDTH never need to insert NOP instructions

Variable Instruction: Bundle 128 0 Instr2 Instr1 Instr0 Template Instructions: Opcode Predicate register (6) Source1 & Source2 (7 each) Dest (7) Template (8?): Instruction grouping (can do in 3 bits: I0, I1, I2 paired w/ prev) Prefetch hints? Predicated Execution Jerry Huck on instruction-level parallelism: Just because you allow it doesn t mean you re going to get any of it. BRANCHES LIMIT ILP: Sequential, no-predict: normal bank teller Sequential, predict: fill out slip in advance (predict whether deposit or withdrawal) Predicated Execution: fill out both slips, throw away whichever is wrong

Predication Example IF THEN ELSE i1 i2 BEQ a b label1 i3 i4 JUMP label2 label 1: i5 i6 label2: i7 i8 Becomes: i1 i2 p1, p2 <- (a == b) (p1) (p1) (p2) (p2) i3 i4 i5 i6 i7 i8 64 1-bit predicate registers Speculative Loads i1 i2 BEQ a b label1 load use label 1: i3 i4 load.s i1 i2 BEQ a b label1 chk.s use label 1: i3 i4 Why the explicit check? Not explained a good reason

Long Architecture Life Large Register File Like jump from 8 -> 32 in 70 s -> 80 s Fully Interlocked Not tied to a particular implementation VLIW w/ Variable Instruction Width Not tied to particular implementation, Works well for shovels of different widths