Agenda. What is the Itanium Architecture? Terminology What is the Itanium Architecture? Thomas Siebold Technology Consultant Alpha Systems Division
|
|
- Roger Jenkins
- 5 years ago
- Views:
Transcription
1 What is the Itanium Architecture? Thomas Siebold Technology Consultant Alpha Systems Division Agenda Terminology What is the Itanium Architecture? 1
2 Terminology Processor Architectures and Implementations IA64 Architecture Alpha Architecture Intel Itanium Architecture EV4 EV5 EV68 EV7 EV6 implementations Merced Itanium McKinley Itanium 2 processor Madison Future Itanium processor Itanium Processor Family 2
3 Itanium Processor Family Roadmap Intel has enhanced the Itanium Processor Family roadmap To deliver the most competitive product offerings for enterprise customers To pull-in dual core technology as early as possible and deliver a significant performance boost To maintain a consistent introduction rate on new Itanium Processor Family product offerings 2002 Itanium 2 Processor (1 GHz, 3MB L3) 2003 Itanium 2 Processor (Madison & Deerfield) (1.5GHz, 6MB L3) 2004 Itanium 2 Processor (Madison 9M) (>1.5GHz, 9MB L3) 2005 Montecito (Dual Core) Montecito processor will enable dual-core technology Continues PAC611 and maintains the same bus protocol Extends Itanium 2 microarchitecture to 90nm process technology Platform Release target of 2005 Roadmap maintains world class performance Silicon Process 0.18 µm 0.13 µm 90 nm next generation processor technologies New features! PA Alpha EV79 Innovation PA-8800 Alpha EV7 Itanium Explicitly tm 2 Parallel Multiple Cores & Itanium Instruction Integrated Interconnects Computing POWER4 PA-8700 Alpha EV68 SuperScalar IA-32 Processor Family SPARC -III MIPS 14K CISC RISC
4 Itanium2 Processor 221M FETs 421mm 2 90+% of the transistors and 50+% of the die area are devoted to cache and cache support logic! 19.5mm 21.6mm What is the Itanium Architecture? 4
5 Traditional CPU Architectures Performance barriers: - Memory latency - Branches - Loop pipelining - Procedure call / return overhead Headroom constraints : - Hardware-based instruction scheduling - Unable to efficiently schedule parallel execution Resource constraints - Too few registers - Unable to fully utilize multiple execution units EPIC Explicitly Parallel Instruction Computing Basic Ideas Static Hardware Design Compiler creates record of execution Instructions in bundles Machine plays record Distribute among execution units No runtime changes like out-of -order-excution High Scalability of execution units Very Large Instruction Word (VLIW) concept Focus is parallelism 6 instructions in parallel (2 bundles per cycle) High number of execution units Enhancement of VLIW concepts with Predication Indication of parallelism in machine code Speculative data loading 5
6 Improving Performance Itanium architecture boosts performance by allowing compiler to provide information to chip using available compile time information Moving performance burden from microarchitecture (chip) to compiler Itanium architecture code accomplishes the following: Increases instruction level parallelism (ILP) Improves branch handling Reduces memory access cost Supports modular code (note) 6
7 Increasing Instruction Level Parallelism Increasing Instruction Level Parallelism Improving instruction level parallelism (ILP) by: Compiler/assembly writer is able to explicitly indicate parallelism Instruction groups Three-instruction-wide word Instruction bundle Two executed per cycle Massive resources on chip Large number of registers to avoid register contention 7
8 Instruction Format: Bundles & Templates Bundle (123 bits) Set of three instructions Template (5 bits) Identifies types of instructions in bundle One of Integer, Memory, Branch, Floating, extended Identifies independent operations ( stops ) -> MM_F Defines execution units to be invoked executing the bundle Compiler can schedule functional units to avoid contention Explicitly Parallel Instruction Computing EPIC S2 S1 S0 T 128-bit instruction bundles from I-cache Processor Fetch one or more bundles for execution (Implementation, Itanium takes two.) functional units MEM MEM INT INT FP FP B B B Try to execute all instructions in parallel, depending on available units. Retired instruction bundles 8
9 Instruction Groups Instruction groups: Set of instructions No dependencies (read-after-write) within group May execute in parallel The processor executes as many instructions per instruction group as possible, based on its resources Must contain at least one instruction (no upper limit) Instruction groups are indicated by cycle breaks (;;) Instruction groups and bundles ld8 r5 = [r7] sub r1 = r2, r3 add r10 = r20, r21 ;; add r1 = r1, r5 ;; st8 [r7] = r1 Instructions within a group may not have any register dependencies within the group. ;; indicates the end of a group. Instruction bundles {.mii ld8 r10, [r5] add r1 = r2, r3 add r4 = r5,r6 } // template // slot 0, Memory // slot 1, Integer // slot 2, Integer Instructions are fetched and executed in bundles. 9
10 Instruction groups and bundles Itanium and Itanium2 fetch 2 bundles at a time for execution. They may or may not execute in parallel. Handwritten code instr instr instr ;; instr instr ;; instr intsr instr instr instr ;; instr instr ;; instr Code generator Instruction bundles instr instr instr tmpl instr instr instr tmpl instr instr nop tmpl instr nop nop tmpl instr instr nop tmpl instr instr nop tmpl intsr instr instr tmpl Forgetting end-of-group may be fatal: add r1 = r1, r5 ;; st8 [r7]= r1 Fetch Execution instr instr instr tmpl instr instr instr tmpl Can the bundle pair Execute in parallel? Code generator creates bundles, possibly including nops. There are two difficulties: 1) Finding instruction triplets matching the defined templates. 2) Matching pairs of bundles that can execute in parallel. Massive On Chip Resources Several register files visible to the programmer: 128 General registers 128 Floating-point registers 64 Predicate registers 8 Branch registers 128 Application registers Instruction Pointer (IP) register Control Registers Process Status Register (includes slot index within current bundle) 10
11 Improving Branch Handling What is the problem? Traditional CPUs: Branch-prediction is used to predict the most likely set of instructions Correct branch prediction keeps the execution pipelines full A mispredicted branch flushes the pipeline with a large penalty Itanium architecture improves branch handling: Provide a way to minimize branches using predicates Provide support for special branch instructions counted loop 11
12 Branch Handling Predication Conditional execution of instructions When the predicate is true, the instruction is executed When it is false, the instruction is treated as a NOP Predication converts a control dependency into a data dependency Predication eliminates branches in the code Speculation Predication Traditional code: if (a>b) c = c + 1 else d = d * e + f Avoid branch by using predicated code p1, p2 = compare(a>b) if (p1) c = c + 1 if (p2) d = d * e + f Predicate p1 set to 1 if compare is true, and to 0 if it evaluates to false p2 is the complement of p1 12
13 Speculation Predication Before: Instructions c = c + 1 and d = d * e + f are control dependant on a<b After: Instruction are data dependant: Values of p1 and p2 They determine execution The branch is eliminated Predication Traditional Architecture Itanium Architecture Cmp a,b Jump br NEQ pt Cmp a,b pt, pf Y = 3 pf Y = 4 then Y = 3 Jump brend Y = 4 else Code for both paths loaded and routed to different execution pipelines. Only one branch will have a valid predicate and be executed. 13
14 Reducing Memory Access Cost Reducing Memery Access Cost Itanium architecture eliminates many memory accesses through: large register files to manage work in progress better control of the memory hierarchy (cache hints) Itanium architecture reduces remaining memory accesses by: moving load instructions earlier in the code Data speculation - the execution of a load before a preceeding store Control speculation - the execution of a load before its guarding branch hides memory latency enables the processor to bring in the data in time avoids stalling the processor 14
15 Data Speculation Advanced Loads Load is performed before a store that logically precedes it may potentially use the same address also referred to as advanced load at compile time memory addresses need to be disambiguated (relationship) Itanium Traditional architecture sequence: sequence: aload(ld_addr,target) store(st_addr,data) /* other load(ld_addr,target) operations including uses of target use(target) */ store(st_addr,data) acheck(target,recovery_addr) use(target) Control Speculation Load is performed before a store that s guarded by a branch Need to check for exceptions Traditional Itanium architecture sequence: sequence: if a>bsload(ld_addr1,target1) then sload(ld_addr2,target2) load(ld_addr1,target1) /* other operations including usage of else target1/target2 */ load(ld_addr2, if a>b target2) then scheck(target1,recovery_addr1) else scheck(target2, recovery_addr2) 15
16 Massive Memory Resources Physical memory Full implementation will address 16 EB of physical memory (2 64 ) 16,000,000,000GB Itanium architecture microprocessor has 44-bit address bus 16TB (16,000GB) physical memory addressable Itanium2 architecture microprocessor has 50-bit address bus Virtual memory Itanium architecture microprocessor uses 50-bits Itanium2 architecture microprocessor uses 64-bits Supporting Modular Code 16
17 Procedure Call Overhead Modular programs create more overhead Programs tend to be call intensive Register space shared by caller and callee Call/Returns require register save/restores Frequent memory access Limitations due to resource shortage Itanium solution Massive register resources Renaming, rotating Integer registers stackable Register Stack Engine (RSE) Eliminates memory accesses Allows to allocate local registers dynamically Register Stack The general register stack is divided into two subsets: Static: 32 permanent registers (r0-r31) visible to all procedures Used for global variables Stacked: 96 other registers are like a stack procedure code allocates up to 96 registers for a frame Frame allocation: previous frame is hidden first register is renamed to logical register r32 small frames eliminate/reduce saving/restoring registers to/from memory 17
18 Procedure Call Overhead IA-32 Procedure A call B Itanium Architecture Procedure A call B Procedure B Procedure B save current register state alloc, no save! restore previous register state no restore! (remap) return... return Register Stack Engine (RSE) When a procedure is called New frame of registers is made available Caller s register content remain in registers, invisible and inaccessible to called procedure If deep nesting exhausts physical registers the RSE will save contents of hidden registers to memory to free up resources On return to caller, caller s register content automatically restored RSE works in background, utilizing unused memory bandwidth Activity not visible to application programs 18
19 Loop Optimization Overhead Enhance loop performance: Done by unrolling loops Causes code expansion Prologue/epilogue add to code size Itanium solution Software pipelining Architecture support Minimal prologue/epilogue code Predication Loop control registers (LC, EC) Loop branches (br.ctop, br.wtop) IA64 Instruction Peculiarities There is a floating point multiply and add instruction, fma (f= a*b+c) A simple floating point multiply is a fma with c=0. A simple floating point add is a fma with b=1. There is an integer multiply and add instruction, which executes in fp registers! There is a memory fence instruction: mf (Alpha: MB) There are three atomic semaphore instructions: xchg, cmpxchg and fetchadd. There are no load/store instructions with immediate offsets a la LDQ R1, 32(R5) on Alpha. There are speculative and advanced loads that do not exist on Alpha. The Register Stack Engine (RSE) is a powerful tool in procedure nestings. 19
20 Itanium Architecture Training Q & A 20
21 21
Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining
Several Common Compiler Strategies Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Basic Instruction Scheduling Reschedule the order of the instructions to reduce the
More informationUCI. Intel Itanium Line Processor Efforts. Xiaobin Li. PASCAL EECS Dept. UC, Irvine. University of California, Irvine
Intel Itanium Line Processor Efforts Xiaobin Li PASCAL EECS Dept. UC, Irvine Outline Intel Itanium Line Roadmap IA-64 Architecture Itanium Processor Microarchitecture Case Study of Exploiting TLP at VLIW
More informationUNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation.
UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation. July 14) (June 2013) (June 2015)(Jan 2016)(June 2016) H/W Support : Conditional Execution Also known
More informationIntel IA-64 Architecture Software Developer s Manual
Intel IA-64 Architecture Software Developer s Manual Volume 1: IA-64 Application Architecture January 2000 Order Number: 245317-001 THIS DOCUMENT IS PROVIDED AS IS WITH NO WARRANTIES WHATSOEVER, INCLUDING
More informationAdvanced Computer Architecture
ECE 563 Advanced Computer Architecture Fall 2010 Lecture 6: VLIW 563 L06.1 Fall 2010 Little s Law Number of Instructions in the pipeline (parallelism) = Throughput * Latency or N T L Throughput per Cycle
More informationCS425 Computer Systems Architecture
CS425 Computer Systems Architecture Fall 2017 Multiple Issue: Superscalar and VLIW CS425 - Vassilis Papaefstathiou 1 Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order
More informationThe IA-64 Architecture. Salient Points
The IA-64 Architecture Department of Electrical Engineering at College Park OUTLINE: Architecture overview Background Architecture Specifics UNIVERSITY OF MARYLAND AT COLLEGE PARK Salient Points 128 Registers
More informationVLIW/EPIC: Statically Scheduled ILP
6.823, L21-1 VLIW/EPIC: Statically Scheduled ILP Computer Science & Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind
More informationECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 15 Very Long Instruction Word Machines
ECE 552 / CPS 550 Advanced Computer Architecture I Lecture 15 Very Long Instruction Word Machines Benjamin Lee Electrical and Computer Engineering Duke University www.duke.edu/~bcl15 www.duke.edu/~bcl15/class/class_ece252fall11.html
More informationECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 14 Very Long Instruction Word Machines
ECE 252 / CPS 220 Advanced Computer Architecture I Lecture 14 Very Long Instruction Word Machines Benjamin Lee Electrical and Computer Engineering Duke University www.duke.edu/~bcl15 www.duke.edu/~bcl15/class/class_ece252fall11.html
More informationLecture 13 - VLIW Machines and Statically Scheduled ILP
CS 152 Computer Architecture and Engineering Lecture 13 - VLIW Machines and Statically Scheduled ILP John Wawrzynek Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~johnw
More informationUnderstanding the IA-64 Architecture
Understanding the IA-64 Architecture Gautam Doshi Senior Architect IA-64 Processor Division Corporation August 31, 99 - September 2, 99 Agenda Today s Architecture Challenges IA-64 Architecture Performance
More informationIntel Itanium Architecture Software Developer s Manual
Intel Itanium Architecture Software Developer s Manual Volume 1: Application Architecture Revision 2.0 December 2001 Document Number: 245317-003 THIS DOCUMENT IS PROVIDED AS IS WITH NO WARRANTIES WHATSOEVER,
More informationLecture 9: Multiple Issue (Superscalar and VLIW)
Lecture 9: Multiple Issue (Superscalar and VLIW) Iakovos Mavroidis Computer Science Department University of Crete Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order
More informationProcessor (IV) - advanced ILP. Hwansoo Han
Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle
More informationIntel released new technology call P6P
P6 and IA-64 8086 released on 1978 Pentium release on 1993 8086 has upgrade by Pipeline, Super scalar, Clock frequency, Cache and so on But 8086 has limit, Hard to improve efficiency Intel released new
More informationComputer Science 146. Computer Architecture
Computer Architecture Spring 2004 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu Lecture 12: Hardware Assisted Software ILP and IA64/Itanium Case Study Lecture Outline Review of Global Scheduling,
More informationEECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)
Evolution of Processor Performance So far we examined static & dynamic techniques to improve the performance of single-issue (scalar) pipelined CPU designs including: static & dynamic scheduling, static
More informationCPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor
1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic
More informationThe Processor: Instruction-Level Parallelism
The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy
More informationAdvance CPU Design. MMX technology. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. ! Basic concepts
Computer Architectures Advance CPU Design Tien-Fu Chen National Chung Cheng Univ. Adv CPU-0 MMX technology! Basic concepts " small native data types " compute-intensive operations " a lot of inherent parallelism
More informationIntel Enterprise Processors Technology
Enterprise Processors Technology Kosuke Hirano Enterprise Platforms Group March 20, 2002 1 Agenda Architecture in Enterprise Xeon Processor MP Next Generation Itanium Processor Interconnect Technology
More informationCPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor
Single-Issue Processor (AKA Scalar Processor) CPI IPC 1 - One At Best 1 - One At best 1 From Single-Issue to: AKS Scalar Processors CPI < 1? How? Multiple issue processors: VLIW (Very Long Instruction
More informationUNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568
UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 568 Part 10 Compiler Techniques / VLIW Israel Koren ECE568/Koren Part.10.1 FP Loop Example Add a scalar
More informationINSTRUCTION LEVEL PARALLELISM
INSTRUCTION LEVEL PARALLELISM Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 2 and Appendix H, John L. Hennessy and David A. Patterson,
More informationReal Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University
Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel
More informationMultiple Issue ILP Processors. Summary of discussions
Summary of discussions Multiple Issue ILP Processors ILP processors - VLIW/EPIC, Superscalar Superscalar has hardware logic for extracting parallelism - Solutions for stalls etc. must be provided in hardware
More informationCS 152 Computer Architecture and Engineering. Lecture 13 - VLIW Machines and Statically Scheduled ILP
CS 152 Computer Architecture and Engineering Lecture 13 - VLIW Machines and Statically Scheduled ILP Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste!
More informationAdvanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University
Advanced d Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000
More informationItanium 2 Processor Microarchitecture Overview
Itanium 2 Processor Microarchitecture Overview Don Soltis, Mark Gibson Cameron McNairy, August 2002 Block Diagram F 16KB L1 I-cache Instr 2 Instr 1 Instr 0 M/A M/A M/A M/A I/A Template I/A B B 2 FMACs
More informationAdvanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University
Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:
More informationCS 152 Computer Architecture and Engineering. Lecture 16 - VLIW Machines and Statically Scheduled ILP
CS 152 Computer Architecture and Engineering Lecture 16 - VLIW Machines and Statically Scheduled ILP Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste
More informationSam Naffziger. Gary Hammond. Next Generation Itanium Processor Overview. Lead Circuit Architect Microprocessor Technology Lab HP Corporation
Next Generation Itanium Processor Overview Gary Hammond Principal Architect Enterprise Platform Group Corporation August 27-30, 2001 Sam Naffziger Lead Circuit Architect Microprocessor Technology Lab HP
More informationThese slides do not give detailed coverage of the material. See class notes and solved problems (last page) for more information.
11 1 This Set 11 1 These slides do not give detailed coverage of the material. See class notes and solved problems (last page) for more information. Text covers multiple-issue machines in Chapter 4, but
More informationAdvanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University
Advanced Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000
More informationAdvanced issues in pipelining
Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one
More informationHardware-based Speculation
Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions
More informationCS146 Computer Architecture. Fall Midterm Exam
CS146 Computer Architecture Fall 2002 Midterm Exam This exam is worth a total of 100 points. Note the point breakdown below and budget your time wisely. To maximize partial credit, show your work and state
More informationMICROPROCESSOR. Merced Shows Innovative Design. Static, Dynamic Elements Work in Synergy With Compiler
SPECIAL ISSUE COVERING THE 1999 MICROPROCESSOR FORUM MICROPROCESSOR VOLUME 13, NUMBER 13 OCTOBER 6, 1999 REPORT THE INSIDERS GUIDE TO MICROPROCESSOR HARDWARE Merced Shows Innovative Design Static, Dynamic
More informationDonn Morrison Department of Computer Science. TDT4255 ILP and speculation
TDT4255 Lecture 9: ILP and speculation Donn Morrison Department of Computer Science 2 Outline Textbook: Computer Architecture: A Quantitative Approach, 4th ed Section 2.6: Speculation Section 2.7: Multiple
More informationAdvanced Processor Architecture
Advanced Processor Architecture Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong
More informationUpdate for New Implementations. As new implementations of the Itanium architecture
U Update for New Implementations As new implementations of the Itanium architecture are announced, we attempt to post appropriate updates on the support page for Itanium Architecture for Programmers: Understanding
More informationUniprocessors. HPC Fall 2012 Prof. Robert van Engelen
Uniprocessors HPC Fall 2012 Prof. Robert van Engelen Overview PART I: Uniprocessors and Compiler Optimizations PART II: Multiprocessors and Parallel Programming Models Uniprocessors Processor architectures
More informationChapter 4 The Processor 1. Chapter 4D. The Processor
Chapter 4 The Processor 1 Chapter 4D The Processor Chapter 4 The Processor 2 Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline
More informationINTEL Architectures GOPALAKRISHNAN IYER FALL 2009 ELEC : Computer Architecture and Design
INTEL Architectures GOPALAKRISHNAN IYER FALL 2009 GBI0001@AUBURN.EDU ELEC 6200-001: Computer Architecture and Design Silicon Technology Moore s law Moore's Law describes a long-term trend in the history
More informationCISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1
CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More informationIF1/IF2. Dout2[31:0] Data Memory. Addr[31:0] Din[31:0] Zero. Res ALU << 2. CPU Registers. extension. sign. W_add[4:0] Din[31:0] Dout[31:0] PC+4
12 1 CMPE110 Fall 2006 A. Di Blas 110 Fall 2006 CMPE pipeline concepts Advanced ffl ILP ffl Deep pipeline ffl Static multiple issue ffl Loop unrolling ffl VLIW ffl Dynamic multiple issue Textbook Edition:
More informationComputer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley
Computer Systems Architecture I CSE 560M Lecture 10 Prof. Patrick Crowley Plan for Today Questions Dynamic Execution III discussion Multiple Issue Static multiple issue (+ examples) Dynamic multiple issue
More informationArchitectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.
Architectures & instruction sets Computer architecture taxonomy. Assembly language. R_B_T_C_ 1. E E C E 2. I E U W 3. I S O O 4. E P O I von Neumann architecture Memory holds data and instructions. Central
More informationHP PA-8000 RISC CPU. A High Performance Out-of-Order Processor
The A High Performance Out-of-Order Processor Hot Chips VIII IEEE Computer Society Stanford University August 19, 1996 Hewlett-Packard Company Engineering Systems Lab - Fort Collins, CO - Cupertino, CA
More information計算機結構 Chapter 4 Exploiting Instruction-Level Parallelism with Software Approaches
4.1 Basic Compiler Techniques for Exposing ILP 計算機結構 Chapter 4 Exploiting Instruction-Level Parallelism with Software Approaches 吳俊興高雄大學資訊工程學系 To avoid a pipeline stall, a dependent instruction must be
More informationPipelining, Branch Prediction, Trends
Pipelining, Branch Prediction, Trends 10.1-10.4 Topics 10.1 Quantitative Analyses of Program Execution 10.2 From CISC to RISC 10.3 Pipelining the Datapath Branch Prediction, Delay Slots 10.4 Overlapping
More informationEN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design
EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown
More informationChapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST
Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism
More informationLecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )
Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target
More informationAlpha and IA64. Executive Summary
Alpha and IA64 Executive Summary Applications have two types of parallelism: instruction-level parallelism and thread-level parallelism. Instruction-level parallelism enables a processor to issue multiple
More informationGetting CPI under 1: Outline
CMSC 411 Computer Systems Architecture Lecture 12 Instruction Level Parallelism 5 (Improving CPI) Getting CPI under 1: Outline More ILP VLIW branch target buffer return address predictor superscalar more
More informationNOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline
CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism
More informationControl Hazards. Prediction
Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional
More informationComputer Science 246 Computer Architecture
Computer Architecture Spring 2009 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu Compiler ILP Static ILP Overview Have discussed methods to extract ILP from hardware Why can t some of these
More informationCSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1
CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level
More informationAdvanced Computer Architecture
Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata What is a Processors Architecture Instruction Set Architecture (ISA) Describes
More information5008: Computer Architecture
5008: Computer Architecture Chapter 2 Instruction-Level Parallelism and Its Exploitation CA Lecture05 - ILP (cwliu@twins.ee.nctu.edu.tw) 05-1 Review from Last Lecture Instruction Level Parallelism Leverage
More informationComputer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士
Computer Architecture 计算机体系结构 Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II Chao Li, PhD. 李超博士 SJTU-SE346, Spring 2018 Review Hazards (data/name/control) RAW, WAR, WAW hazards Different types
More informationENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design
ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University
More informationE0-243: Computer Architecture
E0-243: Computer Architecture L1 ILP Processors RG:E0243:L1-ILP Processors 1 ILP Architectures Superscalar Architecture VLIW Architecture EPIC, Subword Parallelism, RG:E0243:L1-ILP Processors 2 Motivation
More informationLecture: Static ILP. Topics: predication, speculation (Sections C.5, 3.2)
Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2) 1 Scheduled and Unrolled Loop Loop: L.D F0, 0(R1) L.D F6, -8(R1) L.D F10,-16(R1) L.D F14, -24(R1) ADD.D F4, F0, F2 ADD.D F8, F6,
More informationFour Steps of Speculative Tomasulo cycle 0
HW support for More ILP Hardware Speculative Execution Speculation: allow an instruction to issue that is dependent on branch, without any consequences (including exceptions) if branch is predicted incorrectly
More informationHardware-Based Speculation
Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register
More informationAdvanced processor designs
Advanced processor designs We ve only scratched the surface of CPU design. Today we ll briefly introduce some of the big ideas and big words behind modern processors by looking at two example CPUs. The
More informationCS152 Computer Architecture and Engineering VLIW, Vector, and Multithreaded Machines
CS152 Computer Architecture and Engineering VLIW, Vector, and Multithreaded Machines Assigned April 7 Problem Set #5 Due April 21 http://inst.eecs.berkeley.edu/~cs152/sp09 The problem sets are intended
More informationLec 25: Parallel Processors. Announcements
Lec 25: Parallel Processors Kavita Bala CS 340, Fall 2008 Computer Science Cornell University PA 3 out Hack n Seek Announcements The goal is to have fun with it Recitations today will talk about it Pizza
More informationExploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville
Lecture : Exploiting ILP with SW Approaches Aleksandar Milenković, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Basic Pipeline Scheduling and Loop
More informationEEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)
1 EEC 581 Computer Architecture Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW) Chansu Yu Electrical and Computer Engineering Cleveland State University Overview
More informationROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING 16 MARKS CS 2354 ADVANCE COMPUTER ARCHITECTURE 1. Explain the concepts and challenges of Instruction-Level Parallelism. Define
More informationCS 152 Computer Architecture and Engineering. Lecture 16: Vector Computers
CS 152 Computer Architecture and Engineering Lecture 16: Vector Computers Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste
More informationThe CPU IA-64: Key Features to Improve Performance
The CPU IA-64: Key Features to Improve Performance Ricardo Freitas Departamento de Informática, Universidade do Minho 4710-057 Braga, Portugal freitas@fam.ulusiada.pt Abstract. The IA-64 architecture is
More informationInstruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov
Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Dealing With Control Hazards Simplest solution to stall pipeline until branch is resolved and target address is calculated
More informationThese slides do not give detailed coverage of the material. See class notes and solved problems (last page) for more information.
11-1 This Set 11-1 These slides do not give detailed coverage of the material. See class notes and solved problems (last page) for more information. Text covers multiple-issue machines in Chapter 4, but
More informationIA-64, P4 HT and Crusoe Architectures Ch 15
IA-64, P4 HT and Crusoe Architectures Ch 15 IA-64 General Organization Predication, Speculation Software Pipelining Example: Itanium Pentium 4 HT Crusoe General Architecture Emulated Precise Exceptions
More informationComputer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)
18-447 Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling) Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 2/13/2015 Agenda for Today & Next Few Lectures
More informationHardware-Based Speculation
Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register
More informationChapter 13 Reduced Instruction Set Computers
Chapter 13 Reduced Instruction Set Computers Contents Instruction execution characteristics Use of a large register file Compiler-based register optimization Reduced instruction set architecture RISC pipelining
More informationCS 152, Spring 2011 Section 8
CS 152, Spring 2011 Section 8 Christopher Celio University of California, Berkeley Agenda Grades Upcoming Quiz 3 What it covers OOO processors VLIW Branch Prediction Intel Core 2 Duo (Penryn) Vs. NVidia
More informationDEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Year & Semester : III/VI Section : CSE-1 & CSE-2 Subject Code : CS2354 Subject Name : Advanced Computer Architecture Degree & Branch : B.E C.S.E. UNIT-1 1.
More informationSimultaneous Multithreading (SMT)
Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue
More informationCMSC411 Fall 2013 Midterm 2 Solutions
CMSC411 Fall 2013 Midterm 2 Solutions 1. (12 pts) Memory hierarchy a. (6 pts) Suppose we have a virtual memory of size 64 GB, or 2 36 bytes, where pages are 16 KB (2 14 bytes) each, and the machine has
More informationCOMPUTER ORGANIZATION AND DESI
COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler
More informationHY425 Lecture 09: Software to exploit ILP
HY425 Lecture 09: Software to exploit ILP Dimitrios S. Nikolopoulos University of Crete and FORTH-ICS November 4, 2010 ILP techniques Hardware Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit
More informationHY425 Lecture 09: Software to exploit ILP
HY425 Lecture 09: Software to exploit ILP Dimitrios S. Nikolopoulos University of Crete and FORTH-ICS November 4, 2010 Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit ILP 1 / 44 ILP techniques
More informationUG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects
Announcements UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Inf3 Computer Architecture - 2017-2018 1 Last time: Tomasulo s Algorithm Inf3 Computer
More informationMultiple Instruction Issue. Superscalars
Multiple Instruction Issue Multiple instructions issued each cycle better performance increase instruction throughput decrease in CPI (below 1) greater hardware complexity, potentially longer wire lengths
More informationThese slides do not give detailed coverage of the material. See class notes and solved problems (last page) for more information.
11-1 This Set 11-1 These slides do not give detailed coverage of the material. See class notes and solved problems (last page) for more information. Text covers multiple-issue machines in Chapter 4, but
More informationEECC551 Exam Review 4 questions out of 6 questions
EECC551 Exam Review 4 questions out of 6 questions (Must answer first 2 questions and 2 from remaining 4) Instruction Dependencies and graphs In-order Floating Point/Multicycle Pipelining (quiz 2) Improving
More informationChapter 4 The Processor (Part 4)
Department of Electr rical Eng ineering, Chapter 4 The Processor (Part 4) 王振傑 (Chen-Chieh Wang) ccwang@mail.ee.ncku.edu.tw ncku edu Depar rtment of Electr rical Engineering, Feng-Chia Unive ersity Outline
More informationAdvanced Instruction-Level Parallelism
Advanced Instruction-Level Parallelism Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu EEE3050: Theory on Computer Architectures, Spring 2017, Jinkyu
More informationRISC & Superscalar. COMP 212 Computer Organization & Architecture. COMP 212 Fall Lecture 12. Instruction Pipeline no hazard.
COMP 212 Computer Organization & Architecture Pipeline Re-Cap Pipeline is ILP -Instruction Level Parallelism COMP 212 Fall 2008 Lecture 12 RISC & Superscalar Divide instruction cycles into stages, overlapped
More informationSuperscalar Processors
Superscalar Processors Increasing pipeline length eventually leads to diminishing returns longer pipelines take longer to re-fill data and control hazards lead to increased overheads, removing any a performance
More informationAs the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.
Hiroaki Kobayashi // As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor. Branches will arrive up to n times faster in an n-issue processor, and providing an instruction
More information4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16
4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16 Instruction Execution Consider simplified MIPS: lw/sw rt, offset(rs) add/sub/and/or/slt
More information