Agenda. What is the Itanium Architecture? Terminology What is the Itanium Architecture? Thomas Siebold Technology Consultant Alpha Systems Division

Size: px

Start display at page:

Download "Agenda. What is the Itanium Architecture? Terminology What is the Itanium Architecture? Thomas Siebold Technology Consultant Alpha Systems Division"

Roger Jenkins
5 years ago
Views:

1 What is the Itanium Architecture? Thomas Siebold Technology Consultant Alpha Systems Division Agenda Terminology What is the Itanium Architecture? 1

2 Terminology Processor Architectures and Implementations IA64 Architecture Alpha Architecture Intel Itanium Architecture EV4 EV5 EV68 EV7 EV6 implementations Merced Itanium McKinley Itanium 2 processor Madison Future Itanium processor Itanium Processor Family 2

3 Itanium Processor Family Roadmap Intel has enhanced the Itanium Processor Family roadmap To deliver the most competitive product offerings for enterprise customers To pull-in dual core technology as early as possible and deliver a significant performance boost To maintain a consistent introduction rate on new Itanium Processor Family product offerings 2002 Itanium 2 Processor (1 GHz, 3MB L3) 2003 Itanium 2 Processor (Madison & Deerfield) (1.5GHz, 6MB L3) 2004 Itanium 2 Processor (Madison 9M) (>1.5GHz, 9MB L3) 2005 Montecito (Dual Core) Montecito processor will enable dual-core technology Continues PAC611 and maintains the same bus protocol Extends Itanium 2 microarchitecture to 90nm process technology Platform Release target of 2005 Roadmap maintains world class performance Silicon Process 0.18 µm 0.13 µm 90 nm next generation processor technologies New features! PA Alpha EV79 Innovation PA-8800 Alpha EV7 Itanium Explicitly tm 2 Parallel Multiple Cores & Itanium Instruction Integrated Interconnects Computing POWER4 PA-8700 Alpha EV68 SuperScalar IA-32 Processor Family SPARC -III MIPS 14K CISC RISC

4 Itanium2 Processor 221M FETs 421mm 2 90+% of the transistors and 50+% of the die area are devoted to cache and cache support logic! 19.5mm 21.6mm What is the Itanium Architecture? 4

5 Traditional CPU Architectures Performance barriers: - Memory latency - Branches - Loop pipelining - Procedure call / return overhead Headroom constraints : - Hardware-based instruction scheduling - Unable to efficiently schedule parallel execution Resource constraints - Too few registers - Unable to fully utilize multiple execution units EPIC Explicitly Parallel Instruction Computing Basic Ideas Static Hardware Design Compiler creates record of execution Instructions in bundles Machine plays record Distribute among execution units No runtime changes like out-of -order-excution High Scalability of execution units Very Large Instruction Word (VLIW) concept Focus is parallelism 6 instructions in parallel (2 bundles per cycle) High number of execution units Enhancement of VLIW concepts with Predication Indication of parallelism in machine code Speculative data loading 5

Improving Performance Itanium architecture boosts performance by allowing compiler to provide information to chip using available compile time information Moving performance burden from

6 Improving Performance Itanium architecture boosts performance by allowing compiler to provide information to chip using available compile time information Moving performance burden from microarchitecture (chip) to compiler Itanium architecture code accomplishes the following: Increases instruction level parallelism (ILP) Improves branch handling Reduces memory access cost Supports modular code (note) 6

7 Increasing Instruction Level Parallelism Increasing Instruction Level Parallelism Improving instruction level parallelism (ILP) by: Compiler/assembly writer is able to explicitly indicate parallelism Instruction groups Three-instruction-wide word Instruction bundle Two executed per cycle Massive resources on chip Large number of registers to avoid register contention 7

8 Instruction Format: Bundles & Templates Bundle (123 bits) Set of three instructions Template (5 bits) Identifies types of instructions in bundle One of Integer, Memory, Branch, Floating, extended Identifies independent operations ( stops ) -> MM_F Defines execution units to be invoked executing the bundle Compiler can schedule functional units to avoid contention Explicitly Parallel Instruction Computing EPIC S2 S1 S0 T 128-bit instruction bundles from I-cache Processor Fetch one or more bundles for execution (Implementation, Itanium takes two.) functional units MEM MEM INT INT FP FP B B B Try to execute all instructions in parallel, depending on available units. Retired instruction bundles 8

9 Instruction Groups Instruction groups: Set of instructions No dependencies (read-after-write) within group May execute in parallel The processor executes as many instructions per instruction group as possible, based on its resources Must contain at least one instruction (no upper limit) Instruction groups are indicated by cycle breaks (;;) Instruction groups and bundles ld8 r5 = [r7] sub r1 = r2, r3 add r10 = r20, r21 ;; add r1 = r1, r5 ;; st8 [r7] = r1 Instructions within a group may not have any register dependencies within the group. ;; indicates the end of a group. Instruction bundles {.mii ld8 r10, [r5] add r1 = r2, r3 add r4 = r5,r6 } // template // slot 0, Memory // slot 1, Integer // slot 2, Integer Instructions are fetched and executed in bundles. 9

10 Instruction groups and bundles Itanium and Itanium2 fetch 2 bundles at a time for execution. They may or may not execute in parallel. Handwritten code instr instr instr ;; instr instr ;; instr intsr instr instr instr ;; instr instr ;; instr Code generator Instruction bundles instr instr instr tmpl instr instr instr tmpl instr instr nop tmpl instr nop nop tmpl instr instr nop tmpl instr instr nop tmpl intsr instr instr tmpl Forgetting end-of-group may be fatal: add r1 = r1, r5 ;; st8 [r7]= r1 Fetch Execution instr instr instr tmpl instr instr instr tmpl Can the bundle pair Execute in parallel? Code generator creates bundles, possibly including nops. There are two difficulties: 1) Finding instruction triplets matching the defined templates. 2) Matching pairs of bundles that can execute in parallel. Massive On Chip Resources Several register files visible to the programmer: 128 General registers 128 Floating-point registers 64 Predicate registers 8 Branch registers 128 Application registers Instruction Pointer (IP) register Control Registers Process Status Register (includes slot index within current bundle) 10

11 Improving Branch Handling What is the problem? Traditional CPUs: Branch-prediction is used to predict the most likely set of instructions Correct branch prediction keeps the execution pipelines full A mispredicted branch flushes the pipeline with a large penalty Itanium architecture improves branch handling: Provide a way to minimize branches using predicates Provide support for special branch instructions counted loop 11

12 Branch Handling Predication Conditional execution of instructions When the predicate is true, the instruction is executed When it is false, the instruction is treated as a NOP Predication converts a control dependency into a data dependency Predication eliminates branches in the code Speculation Predication Traditional code: if (a>b) c = c + 1 else d = d * e + f Avoid branch by using predicated code p1, p2 = compare(a>b) if (p1) c = c + 1 if (p2) d = d * e + f Predicate p1 set to 1 if compare is true, and to 0 if it evaluates to false p2 is the complement of p1 12

13 Speculation Predication Before: Instructions c = c + 1 and d = d * e + f are control dependant on a<b After: Instruction are data dependant: Values of p1 and p2 They determine execution The branch is eliminated Predication Traditional Architecture Itanium Architecture Cmp a,b Jump br NEQ pt Cmp a,b pt, pf Y = 3 pf Y = 4 then Y = 3 Jump brend Y = 4 else Code for both paths loaded and routed to different execution pipelines. Only one branch will have a valid predicate and be executed. 13

14 Reducing Memory Access Cost Reducing Memery Access Cost Itanium architecture eliminates many memory accesses through: large register files to manage work in progress better control of the memory hierarchy (cache hints) Itanium architecture reduces remaining memory accesses by: moving load instructions earlier in the code Data speculation - the execution of a load before a preceeding store Control speculation - the execution of a load before its guarding branch hides memory latency enables the processor to bring in the data in time avoids stalling the processor 14

15 Data Speculation Advanced Loads Load is performed before a store that logically precedes it may potentially use the same address also referred to as advanced load at compile time memory addresses need to be disambiguated (relationship) Itanium Traditional architecture sequence: sequence: aload(ld_addr,target) store(st_addr,data) /* other load(ld_addr,target) operations including uses of target use(target) */ store(st_addr,data) acheck(target,recovery_addr) use(target) Control Speculation Load is performed before a store that s guarded by a branch Need to check for exceptions Traditional Itanium architecture sequence: sequence: if a>bsload(ld_addr1,target1) then sload(ld_addr2,target2) load(ld_addr1,target1) /* other operations including usage of else target1/target2 */ load(ld_addr2, if a>b target2) then scheck(target1,recovery_addr1) else scheck(target2, recovery_addr2) 15

16 Massive Memory Resources Physical memory Full implementation will address 16 EB of physical memory (2 64 ) 16,000,000,000GB Itanium architecture microprocessor has 44-bit address bus 16TB (16,000GB) physical memory addressable Itanium2 architecture microprocessor has 50-bit address bus Virtual memory Itanium architecture microprocessor uses 50-bits Itanium2 architecture microprocessor uses 64-bits Supporting Modular Code 16

17 Procedure Call Overhead Modular programs create more overhead Programs tend to be call intensive Register space shared by caller and callee Call/Returns require register save/restores Frequent memory access Limitations due to resource shortage Itanium solution Massive register resources Renaming, rotating Integer registers stackable Register Stack Engine (RSE) Eliminates memory accesses Allows to allocate local registers dynamically Register Stack The general register stack is divided into two subsets: Static: 32 permanent registers (r0-r31) visible to all procedures Used for global variables Stacked: 96 other registers are like a stack procedure code allocates up to 96 registers for a frame Frame allocation: previous frame is hidden first register is renamed to logical register r32 small frames eliminate/reduce saving/restoring registers to/from memory 17

18 Procedure Call Overhead IA-32 Procedure A call B Itanium Architecture Procedure A call B Procedure B Procedure B save current register state alloc, no save! restore previous register state no restore! (remap) return... return Register Stack Engine (RSE) When a procedure is called New frame of registers is made available Caller s register content remain in registers, invisible and inaccessible to called procedure If deep nesting exhausts physical registers the RSE will save contents of hidden registers to memory to free up resources On return to caller, caller s register content automatically restored RSE works in background, utilizing unused memory bandwidth Activity not visible to application programs 18

19 Loop Optimization Overhead Enhance loop performance: Done by unrolling loops Causes code expansion Prologue/epilogue add to code size Itanium solution Software pipelining Architecture support Minimal prologue/epilogue code Predication Loop control registers (LC, EC) Loop branches (br.ctop, br.wtop) IA64 Instruction Peculiarities There is a floating point multiply and add instruction, fma (f= a*b+c) A simple floating point multiply is a fma with c=0. A simple floating point add is a fma with b=1. There is an integer multiply and add instruction, which executes in fp registers! There is a memory fence instruction: mf (Alpha: MB) There are three atomic semaphore instructions: xchg, cmpxchg and fetchadd. There are no load/store instructions with immediate offsets a la LDQ R1, 32(R5) on Alpha. There are speculative and advanced loads that do not exist on Alpha. The Register Stack Engine (RSE) is a powerful tool in procedure nestings. 19

20 Itanium Architecture Training Q & A 20

21 21

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

Several Common Compiler Strategies Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Basic Instruction Scheduling Reschedule the order of the instructions to reduce the