Advanced Com puter Architecture: s1/ 2005

Size: px

Start display at page:

Download "Advanced Com puter Architecture: s1/ 2005"

Juliet Dixon
5 years ago
Views:

1 Advanced Com puter Architecture: s1/ 2005 Project Presen tation David Mirabito Handling branches through context fking

2 Currently: b eq Hypothetical instruction stream (operands rem oved)

3 Currently: b eq What now? The operands of this branch won't be fetched, com pared and have the result known until end of the EX stage... 3 m e cycles!

4 Currently: b eq Currenty handled by large, com plex, power consum ing branch prediction logic. In test-printf this is found to be 91.9% accurate f dir prediction, and 90.3% accurate with target add ress.

5 Currently: b eq Assum ue: Branch SHOULD be taken Pred icted crectly At this stage (3 cycles later) we can be sure we p red icted crectly.

6 Currently: b eq Assum ue: Branch SHOULD be taken Pred icted crectly At this stage (3 cycles later) we can be sure we p red icted crectly. BUT, 10% of the tim e...

7 Currently: b eq Assum ue: Branch SHOULD be taken Pred icted in crectly

8 Currently: 0x118 0x11c 0x120 b eq i su b sw Assum ue: Branch SHOULD be taken Pred icted in crectly Here we realise we were wrong. Have to nullify increct insts and start again. The am ount of nullified instructions will only increace as fetch, dispatch and execute widths grow. On this sim plescalar m odel, this can be up to 4 instructions per cycle: 12 potential instructions wasted.

9 Currently: 0x118 0x11c 0x120 b eq i su b sw Assum ue: Branch SHOULD be taken Pred icted in crectly Here we realise we were wrong. Have to nullify increct insts and start again. The am ount of nullified instructions will only increace as fetch, dispatch and execute widths grow. On this sim plescalar m odel, this can be up to 4 instructions per cycle: 12 potential instructions wasted.

10 Currently: 0x118 0x11c 0x120 b eq i su b sw Assum ue: Branch SHOULD be taken Pred icted in crectly These represent wasted fetch bandwidth, com putation cycles and instigate fetching unneded data/ insts from system m em y. 8.1% x branches com itted = m ispredicted branches = wasted cycles = 4.9% of execution tim e.

11 Elsewhere... In stru ction stream s from 2 independent th reads Lookahead win dow. Split 50/ 50 f each th read Multiple execute units in a superscalar arch 0x118 0x11c 0x120 b eq i su b sw 0x400 0x404 0x408 0x40c 0x410 0x414 0x418 0x41c 0x420 sw sdd m ov sll add i lu i su bu su b sb HyperThreading allows two threads to be run concurrently, with one using the execution units that the other doesn't need. Backend of cpu is the sim ilar, only instrictions need to writeback to crect register file.

12 Com bining the two... Initially, things proceed as nm al.

13 Com bining the two... b eq Initially, things proceed as nm al. Until a branch is hit, in which case the single stream becom es two logical threads, one following each path of execution (taken / not taken) 0x118 i add

14 Com bining the two... b eq Now, the fetch bandwidth is shared between each of the new 'fked contexts' (2 insts/ cycle each, instead of 4) Beyond the frontend things rem ain sim ilar, as in HT. Only we m ust ensure instructions only retire to the appropriate context 0x118 0x11c i su b add

15 Com bining the two... b eq At this stage, the result of the com parison is m ade known. 0x118 0x11c 0x120 i su b sw add

16 Com bining the two... b eq At this stage, the result of the com parison is m ade known. We can now take the crect context and m erge any changes to its register file / m em y back with the parent context

17 Unftunately... Im plem enting this functionality on top of sim -outder.c within the sim plescalar test suite was a m uch larger undertaking than iginally an ticip ated. Currently: Can fk context upon a branch instruction and split incom ing instructions between these 50/ 50. When the branch reaches writeback the appropriate context is selected and the m odified registers are written back to the parent. But: Execution does not run to com pletion, m em y reads/ writes across contexts are being crupted, this leads to an increct address being loaded and an attem pted read from 0x , crashing the app. However: This is after 4043 cycles, 3626 instructions, so I will attem pt to m ake what conclusions I can.

18 Stats... Num branches encountered: 781 % cycles in fked state: 64.3% (2603 / 4043) avg num insss in context[0]: avd num insts in context[1]: % tim e stalled context[0]: % tim e stalled context[1]: % tim e stalled context[2]: avg am ount of registers / m em locations writtenback during context:

19 Observations... Som e things I noticed whilst stepping through traces: * This will only ever be wthwhile if we only fk the tim es we m is-predict. Perhaps not necessary to do this every branch. * Still quite useful during com pulsy m isses in the branch predict * Can aid perfm ance by prem aturely warm ing cache f the exit code of a loop. We can brace against the cost of tlb/ cache m iss on this code during the 2 nd and other iterations of aloop. * It m ight be beneficial to take advantage of known com piler quirks: eg: beq r0 r0 XXX should be considered a non-conditional branch and not be fked. It is advantagous that this isn't currently done f J insts. * It is allowable in the PISA architecture to have 2 adjacent branch insts. Quite often one both child contexts stall when they too com e across a branch and cannot fk. This indicates that m e contexts would allow increaced perfm ance (and troubles)

20 Wishlist... Other things to im plem ent: (in increacing der of need): * Varying priities to each context (eg: 27/ 75), based on confidence level of the branch predict. * Suppt f m e than 1 level of fking, so if a fked context encounteres another branch it no longer needs to stall. * Sm arter handling of JAL / JR com binations. Currently can only be done in root context, to save cruption of the return addr stack in the branch predict * Better repting / accounting. * Com plete program crectness Som e of these can/ will be achieved befe the rept is due.

21 Conclusions... * In all likelyhood, this idea is not wth being im plem ented, considering cost:ben efit ratio. * Have read other papers doing sim ilar things that concluded the sam e thing. * Im plem enting a new idea and seeing how it affects the program trace++ * Yet still im m ensely useful as alearning exercise: Actually seeing register, control and data dependancies wk them selves out in an out of der environm ent perfectly brings hom e ideas learned in class * Also skills involved in wking on a large, fign codebase built upon

Advanced Computer Architecture: s1/2005

Advanced Computer Architecture: s1/2005 Project Presentation David Mirabito Hling branches through context fking beq Hypothetical instruction stream (opers removed) beq What now?the opers ofthis branch