ECE 172 Digital Systems. Chapter 14 Itanium EPIC Processor Architecture. Herbert G. Mayer, PSU Status 5/10/2018

Size: px

Start display at page:

Download "ECE 172 Digital Systems. Chapter 14 Itanium EPIC Processor Architecture. Herbert G. Mayer, PSU Status 5/10/2018"

Cornelius Dickerson
5 years ago
Views:

1 ECE 172 Digita Systems Chapter 14 Itanium EPIC Processor Architecture Herbert G. Mayer, PSU Status 5/10/2018 1

2 Syabus Introduction Inte Itanium Architecture Data and Memory Itanium Registers Instruction Set Architecture ISA Assember Source Program Bibiography 2

3 Photo of Itanium 2 Processor 3

4 Itanium Processor Bock Diagram 4!

5 Introduction Itanium is Inte s first pubicy announced, commercia 64-bit computer product, aunched 2001, co-deveoped with HP Corp. IPF stands for Itanium Processor Famiy Pubicy Announced: Smart Inte was diigenty and secrety deveoping a contemporaneous, competing 64-bit processor: extended version of its ancient x86 architecture, just in case, as a backup risk hedge! Lucky they did! 64-bit means that ogica address range spans 2 64 different memory bytes; aso natura integer objects are 64 bits wide The exact format of data objects is described in section Data and Memory During its deveopment at Inte, the first generation of Itanium processors was internay code-named Merced The famiy is now officiay caed IPF, for Itanium Processor Famiy, whie eary it was referred to as IA-64, for Inte 64-bit architecture; conficting ater with 64-bit version of x86 famiy! 5

6 Introduction Inte s Itanium architecture is radicay, competey different from the widey used 32-bit IA-32 architecture IA-32 shoud be referred to as x86 architecture, est one incorrecty infers today that it be restricted to 32-bit addresses and integer types of 32-bit ength That imitation no onger exists since introduction of 64-bit versions about ½ year after AMD s extension of IA-32 to 64 bits; see aso EM64T Imagine how Inte fet, when AMD, the company having produced CPUs compatibe with Inte s chips, suddeny had a more advanced, attractive x86 CPU! 6

7 Introduction Pat Gesinger, former Inte VP, with Itanium Chips 7

8 Inte Itanium Architecture Interestingy, IA-32 object code is executabe on Itanium processors; caveat! More interesting yet, even the Hewett-Packard PA- RISC code is executabe on Inte s and HP s nove 64-bit IPF processor HP and Inte were strategic partners in definition, deveopment, and cost sharing of the IPF, with HP having initiated the deveopment Cautious about performance inferences! Just because IA-32 object code is executabe on IPF, one shoud not deduce such code executes on IPF as fast as on x86 processor!! 8

9 Inte Itanium Architecture IPF is Inte s and HP s first instance of the nove EPIC architecture; different from PA Risc, different from x86! EPIC stands for Expicity Parae Instruction Computing. It is Inte s first aunched 64-bit architecture; the second was aunched ater (1q2004), with EM64T, the first 64-bit version of the ancient x86 architecture HP aready had a 64-bit version with Performance Architecture (PA) RISC processor at time of Itanium aunch Expicit means, the assemby anguage programmer bears the inteectua burden (or the smart compier) to take advantage of the paraeism in the architecture; see ref [8] It is not the processor that automaticay expoits the numerous, parae computing modues; the microprocessor needs to be tod! 9

10 Inte Itanium Architecture As a consequence, compiers for IPF are highy compex; see Donad Knuth s comment, ref [7] Compier compexity is not desirabe, as that means more errors, decreased object code quaity, something a new architecture shoud avoid On the other hand, the IPF has provided expicit architectura features that enabe impementing highy optimizing compiers A case in point is architectura support for software pipeined oops (SW PL) Certain source constructs et the compier emit SW PL oops that need no proogue and epiogue Absence of Proogue and Epiogue not ony renders the object code more compact, but aso faster 10

11 Inte Itanium Architecture Parae means an Itanium processor gains speed not soey via high cock rates, but via simutaneous execution of mutipe operations in one cock cyce Key concepts refined, or newy introduced, in IPF incude: predication, branch prediction, branch eimination, conditiona move, specuation, parae comparisons, and a arge register fie The first impementation of the new 64-bit Inte + HP Itanium architecture ony impemented 44 physica of the 64 ogica address bits 11

12 Inte Itanium Architecture With just 44 address bits, the tota initia address range of first Itanium HW was about a miionth of ogica address range, yet ~4000 times arger than on od-fashioned 32-bit architecture In its second generation, 56 physica bits of the 64-bit ogica address space were impemented in HW Product name of second generation: Itanium 2 Short-term, no severe imitations were expected with restricted 56-bit addresses Sti about 16 miion times arger than 32-bit addressing space Integer type operands are of course fu 64 bits wide 12

13 Inte Itanium Architecture Unike earier parae VLIW architectures, EPIC has no fixed width instruction encoding Instead, operations can be combined to function in parae; from a singe instructing to many instructions can be combined Critica in EPIC is that a code be written assuming parae semantics within a group (to be expained ater), and sequentia semantics across groups To be abe to run in parae, the machine is buit with mutipe execution modues that can a work at the same time Aows natura architecture migration from say, 6 HW modues executing on today s Itanium, to as many as can be crammed into a future siicon microprocessor, years from now 13

14 Inte Itanium Architecture To iustrate a sampe taken from ref [1], consider 2 memory operands a and b to be swapped temp := a; // a, b, temp, are memory ocs a := b; b := temp; The semicoon operator ; impies sequentia semantics. On a machine with parae semantics, it woud be sufficient to write a := b, // operand atching needed b := a; // operand atching needed With the comma operator, impying parae semantics, simiar to syntactic conventions in the programming anguage Ago-68 This source snipped is just a generic exampe; NOT a sampe of the Itanium assemby anguage 14

15 Data & Memory 15

16 Data and Memory Native data types of IPF resembe conventiona 32-bit architectures, except for the onger 64-bit integer and unsigned formats An extension over IA-32 object code is the IPF bunde Data types incude integer, unsigned, foating-point, and pointer Integers are of different widths: byte, word, doubeword, or quad-word precision Length in bits as we as min and max vaues are isted beow: 16

17 Data and Memory, Min Max Type Byte Word Doubeword+ Quad- Integer [bits] word+ 64 Unsigned [bits] Pointer [bits] NA NA Comp Foat [bits] NA NA 32, 64 64, 80 Type byte Word Doube-word Quad-word Minint ,768-2,147,483,648 "-9,223,372,036,854,775,808" Maxint ,767 2,147,483,647 "9,223,372,036,854,775,807" Minunsigned Maxunisgned ,535 4,294,967,295 "18,446,744,073,709,551,615" 17

18 Data and Memory Negative numbers are represented in two s compement format, with the sign-bit in the mostsignificant position Foating-point data use the IEEE 754 standard Bits representing integer vaues are numbered from 0 in the east significant position (rightmost position) to higher vaues For exampe, the most significant bit in a doube word is in position indexed 31 (Note the unusua word definition on Inte architectures: 2 bytes) Maximum address on first generation Itanium was ony 17,592,186,040,322 or ; grew in its 2 nd generation to 56 bits, and is now a fu 64-bits ong 18

19 Data and Memory Bytes are stored in itte-endian order by defaut Possibe to programmaticay seect itte- or bigendian order, by setting the be bit in the user mask, a specia status register That be bit (for big-endian) does not affect how instructions are stored or fetched from memory Object code is aways represented in itte-endian order; programmer seected endianness ony impacts data In itte-endian order, data bytes with the owest numeric vaue are stored in the byte with the owest address; conversey for big-endian order 19

20 Data and Memory Data quad-word 0x is stored as: Data stored in 8 adjacent bytes in memory in itte-endian order: addr: 0 addr: 1 addr: 2 addr: 3 addr: 4 addr: 5 addr: 6 addr: 7 08 x 07 x 06 x 55 x 04 x 03 x 02 x 11 x Same int vaue 0x stored in big-endian order: byte7 byte6 byte5 byte4 byte3 byte2 byte1 byte0 11 x 02 x 03 x 04 x 55 x 06 x 07 x 08 x 20

21 Itanium Registers The Itanium processor has 128 genera registers (GR), 128 foating-point registers (FR), 64 singebit predicate registers (PR), 8 branch registers (BR), and 128 appication registers (AR) In addition, there are Performance Monitor Data registers (PMD), processor identifiers (CPUID), a Current Frame Marker register (CFM), user mask (UM), and instruction pointer registers (IP) GRs, FRs, BRs, ARs, CPUIDs, IP, and PMDs are 64 bits wide PRs are 1 bit wide, whie the UM hods 6 and the CFM 38 bits; depicted beow: 21

22 Itanium Register Fie GR FR PR BR AR gr fr pr 0 0 br ar 0 Kr0 gr fr pr 1 0 br gr fr pr 2 0 br ar 7 Kr7 gr fr pr 3 0 br gr fr pr 4 0 br ar 16 RSC gr fr pr 5 0 br ar 17 BSP br ar 18 BSPST O gr fr pr 10 0 br ar 19 RNAT ip 63 0 ar 21 FCR gr fr pr gr fr pr 63 0 cfm 37 0 ar 30 FDR 22 User M ar 32 CCV CPUID um 5 0 ar 36 UNAT cpuid PMD ar 40 FSPR cpuid pmd ar 44 ITC pmd ar 64 LC cpuid n ar 66 EC pmd m 63 0 ar 127

23 Itanium Registers GR The 128 GR registers are the common workhorses during computation They contain integer vaues being computed Possibe to use these integer vaues as machine addresses, thus GRs can be used as pointers in oad- and store-operations A machine instructions can refer to these registers, for reading and writing vaues In addition to the 64 data bits, each GR has an associated NAT bit, which stands for Not A Thing NAT is 1, if the associated register has not been initiaized with vaid data 23

24 Itanium Registers GR NATs support specuation For exampe, if a specuative oad is issued but aborted, before the vaue arrives in its destined GR, the NAT state records that fact Enabes integrity of the machine s exception process There are 2 groups of GR registers: The first 32, GR0 through GR31, are visibe to a software, and are used to hod gobay computed, intermediate vaues However, GR0 is read-ony, providing the constant 0, 64 bits ong 24

25 Itanium Registers GR The next 96, GR32 to GR127, are used to impement a sma but frequenty used portion of the top of the run-time stack; i.e. work ike a specia-purpose topof-stack cache These stack registers are made avaiabe to SW by aocation of a register stack frame, and incude from 0 to 96 registers Registers not used from this subset are inaccessibe to genera SW The stack frame portion impemented via GRs is further partitioned into subsections, one meant to hod oca registers, the other output registers, i.e. resuts of the current function ca 25

26 Sampe Stack Frame, Generic sp Locas + Temps Stack Marker Stack Frame bp Actua Parameters 26

27 Itanium Predicate Registers PR Execution of most IPF instructions can be predicated by one of the PRs Vaue 1 in the PR means: the operation can be competed normay PR vaue 0 means the resut wi not be posted (AKA not committed), even if it has been computed aready. I.e. there wi be no stores and no impact on any AR of the machine Exception of an instruction that cannot be predicated is the oop operation 27

28 Itanium Predicate Registers The PRs are aso partitioned into 2 sections: PR0 through PR15 are static PRs The other 48 are so caed rotating PRs PR0 is an exceptiona register, it can ony be read, and its vaue is aways 1, meaning, the predicate is true; thus PR0 denotes unconditiona execution The remaining 48 PRs are used to hod stage predicates, used during software pipeining 28

29 Branch Registers BR IPF instructions are grouped in bundes, which are 16-byte aigned byte sequences hoding executabe code. Hence their rightmost 4 address bits wi aways be 0 due to aignment; these 4 address bits don t need to be stored expicity Execution of an indirect branch requires an expicit operand On the Itanium architecture this operand is a branch register; a branch register BR hods the branch destination The machine then oads the vaue of the referenced BR into the IP register and execution continues from there; IP stands for Instruction Pointer Executing branch-reated instructions is the way to directy affect the vaue in the instruction pointer, the register that hods the address of the next bunde to be executed 29

30 Current Frame Marker Register CFM Note: Frame Marker often referred to in iterature as Stack Frame; its fixed portion as Stack Marker Each function has a specific stack frame associated with it, which is created at function invocation; it is ceared at function return If a reevant data of a function s stack frame do fit, they are paced in the stack of genera registers; ese the overfowing data must reside in memory Either way, the current frame marker (CFM) hods the frame marker for the function that is currenty active Generay, most functions have sma stack frames 30

31 Current Frame Marker Register CFM Layout of CFM: CFM register Rrb.pr Rrb.fr Rrb.gr sor so sof Meaning of Bits in CFM: Name Bit Fied meaning Sof 0..6 Tota size of stack frame So Size of oca part of stack frame, in words Sor Size of rotating portion of stack frame. The number of the rotating registers is 8 times the sor vaue rrb.gr Register rename base for grs rrb.fr Register rename base frs rrb.pr Register rename base prs 31

32 Appication Registers AR Appication Registers t.b.d.: register Mnemonic Description of register ar 0 ar 7 KR 0 KR 7 Kerne registers ar 8 ar 15 Reserved ar 16 t.b.d. 32

33 Instruction Pointer IP IPF instructions are fetched in units of bundes, which are chunks of 16 bytes, or 128 bits Bundes are stored bunde-aigned The ip can address 18,446,744,073,709,551,616 different bytes (but ony at bunde addresses) The rightmost 4 bits of the ip thus wi aways be zero, due to bunde-aignment Hence these 4 bits don t needs to be stored on microprocessor 33

34 Performance Monitor Data Register These are architecture-provided resources that record the use of hardware modues Contents is read-ony by SW Contrary to performance monitor registers on Inte Pentium architectures, they are user visibe on Itanium! 34

35 Itanium ISA Instruction Set Architecture 35

36 Instruction Set Architecture ISA Paraeism, Dependences, and Groups Itanium instructions packaged in groups can execute in parae; aows fast execution, if HW is avaiabe! Assemby programmer or compier may craft groups as arge as desired; the performance consequence is: A operations embedded in a singe group can be executed simutaneousy, in parae, saving time over the equivaent sequentia execution The physica siicon ange of this is: Of a operations that coud be executed in parae ony those are actuay performed in parae, for which there exist HW resources E.g. on an Itanium 2 impementation of IPF, there are 6 units avaiabe to operate in parae 36

37 Instruction Set Architecture ISA Paraeism, Dependences, and Groups If fewer actions are encosed in a group, some HW wi ide If more actions coud be incuded in a group, then a HW eements are active, yet some degree of possibe paraeism wi be ost; future HW impementations may execute that same object code faster due to the higher degree of paraeism Parae execution is not feasibe if dependencies exist between instructions On Itanium these dependencies are not resoved by the machine It is the human programmer or optimizer that expicity tracks, what can be done in parae, and what must be done in sequence. The machine just runs it, goa: BE FAST! 37

38 Instruction Set Architecture ISA Paraeism, Dependences, and Groups If a resut has to be computed first before it can be read somewhere ese (memory or register), a true dependence exists; AKA data dependence; conventiona to say dependence On Itanium we ca this a RAW (Read after Write) dependence If a resut has to be read first before it can be re-computed, a fase dependence is created, AKA anti-dependence On Itanium this is named WAR (Write after Read) dependency If a resut has to be computed first before it can be computed again, assuming that an intermediate reference is possibe, output dependence is created Itanium cas this third dependence: WAW (Write after Write) dependence 38

39 Instruction Set Architecture ISA Paraeism, Dependences, and Groups In a these cases, the prior operation has to compete, before the dependent can be started; e.g.: d8 r14 = [r3] -- oad GR14 w. 8 bytes addr. by GR3 add r15 = r14, r16 - integer sum into GR15, RAW dep This is an exampe of RAW dependence, AKA true dependence The oading of an 8-byte vaue into (8-byte) register GR14 must compete first, before the addition of the 2 ong integer vaues, hed in GR14 and GR16, can be started Note the assember register names: r14, and not gr14 This is Inte and HP assemby anguage convention! Another assember may use different conventions 39

40 Instruction Set Architecture ISA Assemby Language Format Format of an Itanium assember instruction: In meta-syntax [ and ] brackets mean that the bracketed portion of the instruction is optiona In assemby syntax, square bracket pairs [] express: indirection Carefu not to get confused by 2 different contexts! [(pr)] mnemonic[.comp] dest = src1 [, src2 [, src3 ] ] Meaning of the various assemby anguage fieds: 40

41 Instruction Set Architecture ISA syntax Name Meaning (pr) Predicate register Used to predicate execution; if vaue is 0, the resut is not committed, if true, the resut is committed. pr0 is aways 1, hence the associated instructions are executed unconditionay mnemonic Instruction Name of the instruction to te the assember: which operation to perform comp Competer Further quaifies or competes the instruction specification. There may be mutipe competers per instruction; not a instructions have a competer dest src1 src2 src3 Destination Is the destination of the specified instruction. Choices are: register or memory source one Source operand. Not a instructions require a source. Some instructions aow mutipe sources. Sources may be: Immediate operands, or registers. Memory can be a source via indirection (through a register) source two Ditto source Ditto three 41

42 Instruction Set Architecture ISA Assemby Language Format A sampe assemby anguage instruction is shown next: (p0) add r5 = r4, r3, 1 // (p0) can be skipped This is an integer add instruction that sums up the integer vaues in GR4 and GR3, aso adds integer itera 1 Assigns sum to register GR5. Since the predicate register used is PR0, which is aways true, the commit of the sum to register GR5 is unconditiona, as if no predicate quaifier had been given Predicate registers, when isted, are encosed in ( ) parentheses Not a instructions aow or need a competer. Typica competers are shown beow Some instructions aow mutipe competers, notaby the memory access instructions, and branch instructions 42

43 Instruction Set Architecture ISA Competer Meaning.a For advanced oad; check ater if successfu.c Check.cr If advanced oad was not successfu, cear the reg.nc no cear.s Specuative; e.g. for oad; NOT aowed for store!.many t.b.d..few t.b.d..exc t.b.d. Many.equ.unc etc. more 43

44 Instruction Set Architecture ISA Itanium Bunde Format Executabe code on Itanium comes in units of bundes. A bunde consists of 3 instructions, a grouped with an associated tempate Tempate competes the instruction specification and above a, defines group boundaries Boundary is aso known as a stop. Stop defines where one group ends and another group starts If no stop is incuded in a tempate, this means that the bunde wi be part of a arger group, consisting of more instructions in the next bunde 44

45 Instruction Set Architecture ISA Itanium Bunde Format Each instruction is 41 bits ong, a tempate consumes 5 bits, one tempate per bunde With 3 instructions per bunde, the overa bunde ength is 3 * = 128 bits, fitting into 16 bytes; a bunde-aigned, easiy accompished due to first bunde residing on a mod-16 memory boundary From then on a wi be aigned on 16-byte boundaries With the memory bus being 128 bits wide (or wider on future IPF impementations) and bundes being bundeaigned, fetching instruction memory is fast Requiring one singe transfer on the bus 45

46 Instruction Set Architecture ISA Itanium Bunde Format Genera ayout of a bunde is shown next, with bits ordered from 0 through 127 increasing r. to instruction 2 instruction 1 instruction 0 tempate The tempate serves as a means for the compier to communicate additiona information about instructions 1, 2, and 3, without which they coud be ambiguous One such key piece of information is the pacement of an instruction group stop, in assember ;; 46

47 Instruction Set Architecture ISA Itanium Bunde Format A group stop can occur after instruction 2, or 1, or 0, indicating an earier group must compete execution, before another starts But Itanium instructions aows at most 2 stops in a bunde If 3 stops are needed, a NOOP must be packed into one of the instructions, to effectivey create 2 physica groups, with the third being the NOOP, whose execution order does not matter Compier-generated code performs this workaround automaticay 47

48 Instruction Set Architecture ISA Itanium Bunde Format The tempate specifies which types of instructions are assembed into sot 0, 1, and 2 IPF instructions are partitioned into the foowing 6 groups: Type A I M F B L + X Meaning ALU: integer or memory unit Non-ALU: Integer unit Memory unit Foating-point unit Branch unit Extended unit, or Branch unit 48

49 Instruction Set Architecture ISA Itanium Bunde Format Providing such information in the tempate speeds up instruction decoding, improving execution speed A ist with the Instruction Set Architecture (ISA) tempates and embedded stops is shown next Note at most 2 stops in any of the formats On an architecture that aims to have arge groups, it seems ogica to have few stops (max 2) per bunde 49

50 Instruction Set Architecture ISA Tempate # type sot 0 sot 1 sot2 0 = 0x00 MII Memory unit Integer unit Integer unit 1 = 0x01 MII_ Memory unit Integer unit Integer unit ;; 2 = 0x02 MI_I Memory unit Integer unit;; Integer unit 3 = 0x03 MI_I_ Memory unit Integer unit;; Integer unit;; 4 = 0x04 MLX Memory unit L unit? Extended unit 5 = 0x05 MLX_ Memory unit L unit? Extended unit;; 6 = 0x06 reserved 7 = 0x07 reserved 8 = 0x08 MMI Memory unit Memory unit Integer unit 9 = 0x09 MMI_ Memory unit Memory unit Integer unit;; 10 = 0x0a M_MI Memory unit;; Memory unit Integer unit 11 = 0x0b M_MI_ Memory unit;; Memory unit Integer unit;; 12 = 0x0c MFI Memory unit Foating-point unit Integer unit 13 = 0x0d MFI_ Memory unit Foating-point unit Integer unit;; 14 = 0x0e MMF Memory unit Memory unit Foating-point unit 15 = 0x0f MMF_ Memory unit Memory unit Foating-point unit;; 16 = 0x10 MIB Memory unit Integer unit Branch unit 17 = 0x11 MIB_ Memory unit Integer unit Branch unit;; 18 = 0x12 MBB Memory unit Branch unit Branch unit 19 = 0x13 MBB_ Memory unit Branch unit Branch unit;; 20 = 0x14 reserved 21 = 0x15 reserved 22 = 0x16 BBB Branch unit Branch unit Branch unit 23 = 0x17 BBB_ Branch unit Branch unit Branch unit;; 24 = 0x18 MMB Memory unit Memory unit Branch unit 25 = 0x19 MMB_ Memory unit Memory unit Branch unit;; 26 = 0x1a reserved 27 = 0x1b reserved 28 = 0x1c MFB Memory unit Foating-point unit Branch unit 28 = 0x1d MFB_ Memory unit Foating-point unit Branch unit;; 30 = 0x1e reserved 31 = 0x1f reserved 50

51 Instruction Set Architecture ISA Itanium Bunde Format The difference between above tempates 0x00 and 0x01, both being MII type operations is: after instruction 2 in tempate 0x01 there is a stop, whie in tempate 0x00 there is none In other words, the next bunde after the one for tempate 0x00 wi beong to the same group, and a higher degree of paraeism wi be possibe there 51

52 Instruction Set Architecture ISA Itanium Assemby Code A group is a sequence of 1 or more instructions deimited by a stop. The first instruction in a whoe program is thought to be preceded by a stop Simiary, the ast instruction of a compete program is thought to be foowed by a stop A instructions paced into a singe group can be executed in parae. Whether or not they wi depends on the number of hardware resources avaiabe. In the initia Itanium architecture ony 6 resources were avaiabe In a ater impementation, more HW resources may become avaiabe, thus potentiay speeding up execution of the same od, unchanged Itanium code on a future generation The ;; indicates to the assember, where one boundary ends and thus the next group starts 52

53 Instruction Set Architecture ISA Itanium Assemby Code Some assemby anguage instructions foow: comp.eq p1, p2 = r33, r34 This checks genera purpose registers 33 and 34 for equaity; if equa, predicate register 1 is set to true, predicate register 2 to fase. Otherwise p1 is set to fase and p2 to true. A more compicated case is: (p3) comp.eq.unc p1, p2 = r33, r34 checks if predicate register 3 is true at the start. If so, if registers GR33 and GR34 are equa, register p1 is set to true and p2 to fase, ese the reverse Ese i.e. if p3 is fase a priori then predicate registers 1 and 2 are both set to fase 53

54 Assember Source With & Without Stack Unwind Operations From ref [8] 54

55 Assember for Heo Word, With // heo_word.c assemby with unwind directive // sampe taken from ref [8] // page 1/3.fie "heo.c".pred.safe_across_cas p1-p5, p16-p63.section.rdata, "a", "progbits".aign 8.STRING1: stringz "Heo Word!!!\n".text.aign 16.goba heo#.proc heo# heo:.proogue.save ar.pfs, r34 55

56 Assember for Heo Word, With // heo_word.c assemby with unwind directive // sampe taken from ref [8] // page 2/3 aoc r34 = ar.pfs, 0, 4, 1, 0.vframe r35 mov r35 = r12.save rp, r33 mov r33 = b0 // oad branch register into GR33.body add r36 gp ;; d8 r36 = [r36] mov r32 = r1 br.ca.sptk.many b0 = printf# // b0! ;; 56

57 Assember for Heo Word, With // heo_word.c assemby with unwind directive // sampe taken from ref [8] // page 3/3 mov r1 = r32 mov ar.pfs = r34 mov b0 = r33 // restore branch register.restore sp mov r12 = r35 br.ret.sptk.many b0.endp heo#.goba printf#.type 57

58 Assember for Heo Word, Without // heo_word.c assemby without unwind directive // sampe taken from ref [8] // page 1/3 // The string is defined in the read ony data section.section.rdata, "a", "progbits".aign 8.STRING1: stringz "Heo Word!!!\n" // definition of function heo is in text section // Registers to be saved in oca registers: // gp = r1 - oc0 = r32 // rp = b0 - oc1 = r33 // ar.pfs - oc2 = r34 // sp = r12 - oc3 = r35 58

59 Assember for Heo Word, Without // heo_word.c assemby without unwind directive // sampe taken from ref [8] // page 2/3.text.goba heo.proc heo heo: aoc oc2 = ar.pfs, 0, 4, 1, 0 mov oc3 = sp mov oc1 = b0 // save branch register b0 add out0 gp ;; d8 out0 = [out0] // group of 3 instructions mov oc0 = gp br.ca.sptk.many b0 = printf ;; 59

60 Assember for Heo Word, Without // heo_word.c assemby without unwind directive // sampe taken from ref [8] // page 3/3 mov gp = oc0 mov ar.pfs = oc2 mov b0 = oc1 mov sp = oc3 br.ret.sptk.many b0.endp heo.goba printf.type 60

61 Bibiography 1. Triebe, Water: IA-64 Architecture for Software Deveopers, Inte Press 2000, 308 pages attachment_ciid=c2d2e0aecd2b7110vgnvcm d6e10rc RD&ciid=ce1fd701521c7110VgnVCM d6e10RCRD Donad Knuth: Interview with Donad Knuth Inte Itanium Architecture Assemby Reference Guide, 2002, Inte order number , at 61

62 Definitions 62

63 Branch Eimination Definitions Repacing object code that has conditiona branches, with code that has a straight-forward execution path, acking branches The second version with branches eiminated must be semanticay equivaent to the origina code with branches Everything ese equa, the version without branches generay executes faster due to ess cache misses 63

64 Bunde Definitions Group of 3 instructions pus a tempate, that a fit into a 16-byte ong, 16-byte aigned section of instruction memory on Itanium Tota number of bits =

65 Conditiona Move Definitions Move instruction that transfers bits from source to destination, but ony if an associated condition is true Otherwise the instruction operates ike a noop Such a move can serve as a specia case of branch eimination. For exampe, the C source construct: if ( a > 0 ) x = 99; -- HL source program coud be mapped into the conditiona move: cmov x, #99, a, #0, gt -- hypothetica asm which has no branches. Source operand #99 is moved into memory ocation x ony if the > condition hods between operands a and integer itera 0 65

66 Endian, Endianness Definitions A convention that defines in which order the higher-vaued bytes of a muti-byte data object are addressed Can be programmed on Itanium with be bit If the higher address byte hods the higher numeric vaue, we ca this itte-endian typica on Inte x86 architecture The other way around we ca big-endian ordering typica on IBM 370 architecture 66

67 EPIC Definitions Expicity Parae Instruction Computing, with IPF being the first commercia architecture that impements EPIC Note IPF s abiity to aso execute od Inte x86 and od HP PA object code 67

68 Epiogue Definitions When the steady state of a software pipeined oop competes, there may be yet to be used operands and operations to be computed that woud not fit into the steady state These ast operands must be consumed, some even be generated during the epiogue, and utimatey the pipeine must be drained This is accompished in the object code after the steady state, and that portion of code is caed the epiogue See aso proogue 68

69 Group Definitions A sequence of instructions, each with an associated tempate and a defined stop A group is composed of one bunde or more The stop means, the hardware cannot start executing any subsequent group, unti the current group has competed Syntax notation for stop in Itanium assember is the doube-semicoon ;; 69

70 Parae Comparison Definitions A composite source program condition of the form: ( ( a > b ) && ( c <= d ) ) requires mutipe steps to compute a booean predicate Generay, on a sequentia architecture these mutipe steps are combined via expicit instructions for anding and oring, or ese the fow of contro of execution seects a matching true abe. A this takes time The Itanium processor aows parae evauation of certain composite Booean expressions in one singe step The resut can be used as a predicate in subsequent instructions. Notice that such combined Booean expressions must be side-effect free Is not equivaent to C s short-circuit evauation of compex booean expressions! 70

71 Definitions Parae Comparison, Cont d For exampe, another compex booean expression ( fun( j, k ) && ( i < MAX ) ) cannot be mapped into a parae EPIC comparison Since one operand is a function ca fun( i, k ) with a possiby arge number of parameters, and may have a side-effect on one of the other operands, for exampe i which is yet to be compared This type of booean expression is mapped into sequentia code 71

72 Predication Definitions Is the association of a booean condition with the execution of an instruction sequence. This aows the foowing: Two instruction streams can be executed in parae, ceary requiring mutipe hardware modues; provided on EPIC Both streams have a predicate associated with their operations. Ony the stream with the true predicate is actuay retired; the other wi be aborted and ignored Abort can happen as soon as the predicate is known. This means, the computation of the predicate can proceed in parae with the execution of the two code streams, but must compete by the time these 2 code streams waite for who be the winner An ISA with predication requires bits for the predicates to use, and which direction (true? or fase?) to seect Aso, the discarded code path may contain no side-effect, such as a write to memory! 72

73 Proogue Definitions Before a software pipeined oop body can be initiated, hardware resources (e.g. registers) must be initiaized; we say the oop must be primed This is accompished in the object code before the steady state, caed the Proogue See aso epiogue 73

74 Register Fie Definitions The IPF has a rich set of registers This incudes 128 genera purpose registers (for integer operations), 128 foating-point-, 64 predicate-, 64 branch-, and 128 so-caed appication registers Aso a variety of specia purpose register is visibe; visibe means accessibe by the assemby anguage program Incudes a user mask, stack marker (frame marker), ip, processor id, and performance monitoring registers 74

75 Specuation Definitions If it is suspected --but not sure-- that operand o wi be used in the future, and this operand is not readiy avaiabe (not yet in a high-speed register), and it takes ong to fetch o, a processor may initiate the fetch we before it is actuay used Advantage: by the time o is needed, it is aready avaiabe without deay Disadvantage: if the fow of contro never reaches the pace where o was thought to be needed, then the specuative fetch was superfuous May sti be meaningfu, if a) no side-effects occurred that are harmfu to program correctness, and b) if the hardware resource required to fetch o was ide anyway; then no oss! 75

76 Steady State Definitions The software pipeined object code executed repeatedy, after the Proogue has been initiated, before the Epiogue wi be active, is caed the Steady State Each iteration of the Steady State makes some progress toward mutipe iterations of the origina source oop See aso proogue and epiogue 76

77 Syabe Definitions Is the instruction-ony portion of a bunde A bunde aways hods 3 instructions pus a tempate, the tempate specifying additiona necessary information about an instruction The instruction aone, without the needed tempate information, is a syabe 77

Functions. 6.1 Modular Programming. 6.2 Defining and Calling Functions. Gaddis: 6.1-5,7-10,13,15-16 and 7.7

Functions. 6.1 Modular Programming. 6.2 Defining and Calling Functions. Gaddis: 6.1-5,7-10,13,15-16 and 7.7 Functions Unit 6 Gaddis: 6.1-5,7-10,13,15-16 and 7.7 CS 1428 Spring 2018 Ji Seaman 6.1 Moduar Programming Moduar programming: breaking a program up into smaer, manageabe components (modues) Function: a