Data-Flow Prescheduling for Large Instruction Windows in Out-of-Order Processors

Dt-Flow Prescheduling for Lrge Instruction Windows in Out-of-Order Processors Pierre Michud, André Seznec IRISA/INRIA Cmpus de Beulieu, 35 Rennes Cedex, Frnce {pmichud, seznec}@iris.fr Abstrct The performnce of out-of-order processors increses with the instruction window size. In conventionl processors, the effective instruction window cnnot be lrger thn the issue buffer. Determining which instructions from the issue buffer cn be lunched to the execution units is timecriticl opertion which complexity increses with the issue buffer size. We propose to relieve the issue stge by reordering instructions before they enter the issue buffer. This study introduces the generl principle of dt-flow prescheduling. Then we describe possible implementtion. Our preliminry results show tht dt-flow prescheduling mkes it possible to enlrge the effective instruction window while keeping the issue buffer smll. 1. Introduction Processor performnce is strongly correlted with the clock cycle. Shorter clock cycle hs been llowed by both improvements in silicon technology nd creful processor design. As consequence of this evolution, the IPC (verge number of instructions committed per clock cycle) of future processors my decrese rther thn increse [1]. This IPC decy comes from the dispersion of instruction ltencies. In prticulr, lod ltencies in CPU cycles tend to increse cross technology genertions, while ALU opertion ltency remins one cycle. A solution for overcoming the IPC decy is to enlrge the processor instruction window [6, 15], both physiclly (issue buffer, physicl registers...) nd logiclly, through better brnch prediction ccurcy or brnches removed by prediction. However, the instruction window should be enlrged without impiring the clock cycle. In prticulr, the issue buffer nd issue logic re mong the most serious obstcles to enlrging the physicl instruction window [11]. In this pper, we study the ddition of preschedule stge before the issue stge to combine the benefit of lrge instruction window nd short clock cycle. We introduce dt-flow prescheduling. Instructions re sent to the issue buffer in predicted dt-flow order insted of the sequentil order, llowing smller issue buffer. The rtionle of this proposl is to void using entries in the issue buffer for instructions which opernds re known to be yet unvilble. In our proposl, this reordering of instructions is ccomplished through n rry of schedule lines. Ech schedule line corresponds to different depth in the dt-flow grph. The depth of ech instruction in the dt-flow grph is determined, nd the instruction is inserted in the corresponding schedule line. Lines re consumed by the issue buffer sequentilly. Section briefly describes issue buffers nd discusses relted works. Section 3 describes our processor model nd experimentl set-up. Section presents the generl principle of prescheduling nd introduces dt-flow prescheduling. Section 5 describes possible implementtion for dtflow prescheduling. Section 6 nlyses the efficiency of the implementtion proposed bsed on experimentl results. Finlly, Section 7 gives some directions for future reserch.. Bckground nd relted works The issue buffer is the hrdwre structure mterilizing the instruction window. Instructions wit in the issue buffer until they re redy to be lunched to the execution units. Unlike the reorder buffer [1], instructions cn be removed from the issue buffer soon fter issuing, to mke room for new instructions. The two min phses of instruction issue re the wkeup phse nd the selection phse [11]. The wke-up phse determines which instructions hve their dt dependencies resolved. The selection phse resolves resource conflicts nd determines which instructions cn effectively issue. The dely of the wke-up nd selection phses increses with the issue buffer size [11], which mkes lrge issue buffer hrdly comptible with short clock cycle. In some processors like the Alph 16 [8], the issue buffer is collpsble in order to mintin instructions in sequentil order nd fcilitte the insertion of new instruc-

tions. Mintining the sequentil order llows the selection logic to give priority to older instructions. In currently vilble processors, seprte issue buffers re implemented for integer nd floting-point instructions, typiclly to times smller thn the reorder buffer (in number of instructions). The integer issue buffer typiclly does not exceed entries in current processors (-entry integer queue in the Alph 16, 18-entry integer scheduler in the AMD Athlon, -entry reservtion sttion in the Intel P6...). Both micro-rchitecturl nd circuit-level solutions hve been proposed for enbling the use of lrge instruction window. In [11, 1], it ws proposed to distribute the issue logic mong multiple clusters of execution units. This solution trdes globl communictions for fst locl communictions. The trce processor [13] is n exmple of such proposition. A chrcteristic of these propositions is tht the instruction window size is proportionl to the number of execution units. A circuit-level pproch ws proposed recently for tckling the window size problem specificlly [5] : the reorder buffer nd the issue buffer re merged, nd prllel-prefix circuits re used for the wke-up nd selection phses. The ide of prescheduling is not new. A dependencebsed prescheduler ws proposed in [11], tht tries to form chins of dependent instructions in set of FIFOs. This is further discussed in Section.1. An ide close to ours ws proposed in [3], but with different implementtion. Note on the issue buffer size. In some processors, instructions my hve to be re-issued. For exmple, on the Alph 16 [7], when lod is predicted to hit in the dt cche but ctully misses, two issue cycles re nnulled nd the issue buffer stte is restored. This requires tht instructions remin vlid in the issue buffer for few cycle fter they hve been issued. These instructions constitute n invisible prt of the issue buffer, which size depends on the issue width nd on the number of pipeline stges between the issue stge nd the execution stge. All issue buffer sizes reported in this study re for the visible prt of the issue buffer. 3. Processor model nd experimentl set-up The processor simulted in this pper is n out-of-order supersclr processor. The two processor configurtions simulted, idel nd 8-wy, re described on Tbles 1 nd respectively. The brnch predictor simulted is 3x16k-entry e-gskew predictor [1]. The size of the reorder buffer, i.e., the number of physicl registers, ws fixed lrge enough so tht it does not interfere with our study. Brnch misprediction recovery is performed s soon s mispredicted brnch is executed. The cche ltencies reported in Tble 1 nd re futuristic vlues nticipting smller feture sizes [1]. instruction cche brnch predictor perfect 3x16k-entry e-gskew globl history: 1 brnches unlimited 1 (fetch/decode) 96 instructions vrible N N universl pipelined issue, X execute, retire most int: X=1 cycle (int) mul: X=7, div: X= lod: X=1+ (ddr, cche) store: X=1+1 (ddr, forwrd) fetch bndwidth front-end stges reorder buffer issue buffer issue width execution units bck-end stges min ltencies dt cche memory dependency predictor perfect perfect Tble 1. Idel configurtion fetch 8 instructions front-end stges 1 (+) issue width 8 execution units 8 universl L1 dt cche 8 Kbytes, direct mpped 6-byte lines unlimited bndwidth L dt cche perfect 15-cycle ltency store set SSIT : 16k entries (tgged) predictor LFST : 18 entries (tgged) Tble. 8-wy configurtion : prmeters not specified re identicl to the idel configurtion. We ssume the issue buffer is distinct from the reorder buffer. It schedules ll the instructions, except those tht re executed in the pipeline front-end, like unconditionl jumps. The issue buffer is collpsble. When instructions re competing for the issue bndwidth, instructions tht entered the issue buffer first re given priority. As we focus our ttention on the visible prt of the issue buffer, we did not simulte the impct of pipeline stges between the issue stge nd the execution stge, which is distinct problem. Lod/store dependencies. When considering lrge instruction windows, we must py ttention to dependencies between lods nd stores. Previous studies hve shown tht memory dependencies cn be predicted with high ccurcy using pst behvior. The memory dependency predic-

fetch & decode sequentil order pre-scheduler dt-flow order issue buffer execution Figure 1. The prescheduler sends instructions to the issue buffer in the order defined by dt dependencies. tor used in this study for the 8-wy configurtion is the store set predictor []. The Store Set Identifier Tble (SSIT) is 16k-entry tgged tble (-wy set-ssocitive). When lod misses in the SSIT, it is predicted to crry dependency with no inflight store. A lod is predicted to be dependent on the store encountered the more recently in its store set. The dependency is enforced by the issue logic : the lod will issue fter the store, so tht it cn ctch the correct vlue. As recommended in [], dependencies re enforced between stores belonging to the sme store set in order to reduce the number of memory order violtions. In our simultions, the number of memory order violtions did not exceed % of the number of brnch mispredictions. Benchmrks. All simultions re trce-driven simultions using the IBS trces [16]. The eight trces reflect the execution of sequentil pplictions on MIPS-bsed worksttion, including system ctivity. With the L1 dt cche simulted in the 8-wy configurtion, there is n verge 5% cche miss rtio on our benchmrks, nroff hving the lowest ( %) nd verilog nd video ply the highest (7-8 %) With the brnch predictor simulted, the verge number of instructions between consecutive brnch mispredictions lies between 1 (rel gcc) nd 35 (video ply), nd between nd 5 for other benchmrks.. Prescheduling In tody processors, instructions re pushed in the issue buffer in sequentil order, therefore instructions depending on long dependency chin occupy the issue buffer for long time. All the issue buffer entries re checked on every cycle. This process is time consuming nd the dely increses with the number of entries in the issue buffer. The generl ide behind prescheduling is to llow only instructions which re likely to become fireble in the very next cycles to enter the issue buffer. Informtion on dt dependencies nd instruction ltencies re known before the issue stge nd cn be used for prescheduling. The principle of prescheduling is depicted on Figure 1. Insted of being sent to the issue buffer in sequentil order, instructions re reordered by prescheduler so tht they enter the issue buffer in the dt-flow order, i.e., the order of execution ssuming unlimited execution resources, tking into ccount only dt dependencies. The instructions wit in preschedule buffer until they cn enter the issue buffer. If the predicted dt-flow order is close enough to n optiml issue order, then the issue buffer cn be very smll s it is relieved from the tsk of buffering instructions not yet fireble. In fct, the issue buffer size should be closer to the issue width thn to the effective instruction window size. The job of the hrdwre prescheduler is somewht similr to tht of compiler scheduling instructions within bsic blocks. However, hrdwre prescheduler works on lrge trces of severl tens or hundreds of instructions discovered t run time nd which length is not known priori. Dedlocks. To prevent dedlocks, the prescheduler must ensure tht if instruction B is dependent on instruction A, A enters the issue buffer before B..1. Dependence-bsed prescheduling The dependence-bsed prescheduler presented in [11] is n exmple of prescheduling scheme. The preschedule buffer consists of severl FIFOs. The issue buffer is the set of ll FIFOs heds, hence the issue buffer size is equl to the number of FIFOs. The prescheduling logic forms chins of dependent instructions in FIFOs : n instruction is steered to FIFO such tht it depends on the lst instruction in the FIFO. If it is not possible to ppend n instruction to n existing chin, the instruction is steered to n empty FIFO. When this is not possible, the steering logic stlls until one FIFO gets empty. We verified tht, experimentlly, dependence-bsed prescheduler with N FIFOs is roughly equivlent to n issue buffer of N instructions. A first limittion comes from the complexity of the dt-flow structure of progrms. There re mny very short chins ending on brnch or store, some chins re merged becuse of dydic instructions, severl chins re forked when the sme register vlue is used severl times. There is nother limittion : the optiml distribution in FIFOs would require to enqueue instructions out of the progrm order nd tke into ccount instruction ltencies. Trying to find the optiml distribution on simple exmples convinced us tht this is hrd problem, nd tht it would be difficult to improve on the published heuristic. To overcome these limittions, the dt-flow prescheduler proposed in this pper tkes different pproch. First, 3

ctive line C C1 B B1 F F1 E E1 AD A1D1 1 9 8 7 6 5 3 1 A1 B1 C1 D1 E1 F1 A B C D E F lod r < (r1) dd r < r, 1 store r, (r1) dd r1 < r1, 1 sub r < r1, r bltz r, loop lod r < (r1) dd r < r, 1 store r, (r1) dd r1 < r1, 1 sub r < r1, r bltz r, loop schedule_line = mx({source_use_line}, ctive_line) use_line = schedule_line + execution_ltency Figure. Dt-flow prescheduling exmple it defines globl dt-flow order insted of prtil one. Second, it tkes into ccount instruction ltencies, in prticulr lod ltencies... Dt-flow prescheduling Idelly, one would like to send instructions to the issue buffer only when they become fireble. We try to pproch this idel through rel hrdwre. First we ssume unlimited execution resources. The depth of ech instruction in the dt flow grph is computed, tking into ccount dt dependencies nd instruction ltencies (for simplicity, we ssume ll lods hit in the L1 cche). The dt-flow depth for n instruction corresponds to its idel issue cycle, ssuming unlimited execution resources. The reordering of instructions is done through preschedule buffer implemented s n rry of schedule lines. Ech schedule line is ssocited with n issue cycle. An instruction is inserted in the schedule line corresponding to its idel issue cycle. The issue buffer consumes the lines sequentilly. Hence, ssuming unlimited execution resources nd perfect prescheduling, instructions spend single cycle in the issue buffer, nd instructions in the sme line re issued simultneously. The principle of dt-flow prescheduling is illustrted on n exmple in Figure. The ctive line is the line which is currently feeding the issue buffer. The schedule line number for n instruction is lwys higher thn the current ctive line number. The schedule line is determined with the following sequentil prescheduling lgorithm : schedule line = mx({source use line}, ctive line) use line = schedule line + execution ltency For ech instruction, we define the use line s the line where its result is vilble s source opernd for dependent instructions. Rel hrdwre implementtions will require further trde-offs s shown in the next section. 5. A possible implementtion for dt-flow prescheduling 5.1. The preschedule buffer The preschedule buffer is n rry of schedule lines. Ech line is ssocited with line counter indicting how mny instructions re currently stored in the line. We define the line width s the mximum line counter vlue, tht is, the mximum number of instructions tht we llow in the sme line. The line counter is incremented ech time n instruction is written into the line. If the line counter vlue is lredy equl to the line width, this is line overflow. In ech cycle, s slots re freed in the issue buffer, instructions re tken from the current ctive line to fill these slots. Once ll the instructions in the current ctive line hve been consumed, the ctive line number is incremented. We ssume the ctive line number is incremented t most once per cycle, nd only fter the current ctive line is totlly consumed. Note tht the ctive line number keeps incresing monotoniclly. However in prctice, the number of physicl lines, which we define s the preschedule window, is limited. The ctive line, schedule line nd use line numbers mnipulted re virtul line numbers which we mp onto physicl lines circulrly. When the current ctive line is consumed, the physicl line is recycled nd its line counter is reset. In this study, we hve chosen the following policies for preschedule window overflows nd line overflows : Preschedule window overflow. If the schedule line for n instruction is greter thn or equl to the sum of the ctive line nd the preschedule window, prescheduling is blocked, witing for the ctive line to proceed nd physicl lines to be recycled. Line overflow. Similrly, if the trgeted schedule line is full, prescheduling is blocked witing for the ctive line to proceed. The schedule line of the blocked instruction is simply recomputed with the new ctive line number, s mny times s necessry, until the instruction cn be written in the preschedule buffer. Note on the preschedule buffer implementtion. In this study, it is implicitly ssumed tht the preschedule buffer is implemented with direct-mpped two-dimensionl rry : one dimension is the line number, nd the other dimension is the line counter vlue. We did not focus on optimizing the size of the preschedule buffer, s the ccess to the preschedule buffer cn be pipelined without impiring the performnce excessively. It should be noted tht this is not the only possible implementtion. In prticulr, it my be interesting to introduce some ssocitivity by using line numbers nd/or line counter vlues s tgs.

5.. Schedule line computtion This section describes the hrdwre supports used for implementing the prescheduling lgorithm introduced in Section.. +x +x +x mx(,b,c) + x 5..1 Registers dependencies The register use line numbers re stored in Register Use Line Tble (RULT) similr to register renme tble. Ech RULT entry is ssocited with logicl register. For ech instruction, we must 1. red the RULT entries corresponding to its source register opernds,. compute the schedule line s the mximum of the current ctive line number nd the two source registers use line numbers, 3. dd the instruction execution ltency to the schedule line number to determine the destintion register use line. 5.. Lod/store dependencies. We slightly modified the store set predictor for being ble to preschedule lod on line fter ll the stores in its store set. When store set ID (SSID) is obtined from the SSIT for lod or store, this SSID is used to index the Lst Fetched Store Tble (LFST) []. Ech LFST entry holds the inum of the more recent store in tht store set. The inum is used to enforce lod-store nd store-store dependencies. We modified the LFST entry by dding field indicting the store set mximum use line number (SSMUL). After prescheduling store, we compre the schedule line number of the store with the SSMUL obtined from the LFST (in cse there ws hit). Then we tke the mximum of the two line numbers, nd we write the result in the LFST entry. We ssume the instruction set rchitecture does not llow indexed ddressing, so tht lods hve single register dependency (e.g., MIPS, Alph). When prescheduling lod, the first use line number is register use line red from the RULT, nd the second use line number is the SSMUL red from the LFST, so tht the lod is scheduled on line fter ll the stores in its store set. Although store is forced to be dependent on previous stores in its store set, the SSMUL is not used for prescheduling stores becuse stores re dydic instructions nd this would require n extr input in the schedule line computtion. 5..3 The preschedule pipeline stge Dt-flow prescheduling requires to dd few extr pipeline stges. In prticulr, preschedule stge is necessry for computing the schedule line numbers. This preschedule b c Figure 3. Prescheduling opertor computing mx(, b, c)+x stge is criticl for performnce, s prescheduling is bsiclly sequentil tsk. Nevertheless, we show how it cn be prllelized. The bsic opertion involved in dt-flow prescheduling computes mx(, b, c) +x, with x being smll constnt vlue depending on the instruction opcode. Figure 3 shows possible implementtion. For shortening the dely, the +x opertion cn be performed in prllel with comprisons, s shown on Figure 3. First, we show tht the prescheduling logic my operte on smll virtul line numbers, typiclly 1-bit wide. Therefore, the opertor depicted on Figure 3 should hve propgtion time shorter thn full 6-bit ALU. Second, we show tht dependent mx(, b, c)+x opertions cn be chined without incresing the circuit depth. Mximum virtul line number. In Section 5.1, we did not consider the limittion of virtul line numbers. In prctice however, virtul line numbers re coded with limited number of bits. When computing the use line of the result of n instruction, if the mximum virtul line number is exceeded, prescheduling is blocked until the processor instruction window gets completely drined. Then, the ctive line number nd ll RULT nd LFST entries re reset to, nd prescheduling resumes from the blocked instruction. This method ensures tht the content of the RULT is lwys coherent, which is importnt for voiding dedlocks. We found tht virtul line numbers cn be coded with 1 bits with no significnt performnce loss. For exmple, with 1 bits, if instructions re issued per cycle, control-flow brek is necessry every 16k instructions, which is n order of mgnitude lrger thn the verge distnce between brnch mispredictions. Though this solution is simple, other solutions re possible for keeping the RULT content coherent. For exmple, we could invlidte RULT entry when the lst instruction which wrote in it leves the preschedule buffer (this would require to chnge the definition of the mx opertion in the 5

RULT & input mux bypss from previous u1,u,u3,u i1 i1 o1 mx +x1 u1 i i o mx +x u i3 i3 o3 mx +x3 u3 i i o mx +x u output mux s1 = o1 s = u1 o s3 = u1 u o3 s = u1 u u3 o s j s3 j3 s j verify dydic Figure. Prllel computing of the use line u n nd schedule line s n of group of instructions. lod r < (r1) lod r3 < (r7) dd r3 < r, r3 store r3, (r7) r1 r7 r1 r1 o1 mx + o mx + o3 mx +3 o mx +5 u1 u u3 u output mux s1 = o1 s = o s3 = u1 s = u3 s3 u s r7 verify dydic Figure 5. Exmple of prllel prescheduling. Lods nd stores hve ltency of cycles. We neglect source r3 of the dd nd source r7 of the store. schedule line computtion). Prllel prescheduling. The preschedule stge is principlly constituted of N mx(, b, c)+x opertors, N being the pipeline width. Previous pipeline stges prticipte in the preschedule tsk, performing intr-group dependencies nlysis nd determining instruction ltencies in order to set the inputs of ll mx(, b, c) +x opertors. However this preliminry work is not the core of the problem (the nlysis of intrgroup dependencies is necessry lso for register renming). The min issue is to perform in the sme cycle severl chined mx(, b, c) +x opertions. We present here possible solution to brek dependency chins nd llow the implementtion of dt-flow prescheduling. Figure shows the circuit for computing the schedule line nd use line numbers {s n } nd {u n } of group of instructions, bsed on the opertion mx(, b, c) +x. One entry,, is the ctive line number. The two other entries i n nd i n re the source opernd use cycles. The settings for the i n nd i n sources nd the commnd for the output multiplexor depend on intr-group dependencies determined in previous pipeline stges. If instruction n does not depend on previous instructions in the group, then i n nd i n re tken from the RULT or the LFST, nd the increment x n is equl to the instruction ltency l n. The schedule line number s n, in this cse, is red t the output o n of the mx opertor. Now let use suppose tht instruction n depends on previous instruction in the group. If instruction n is mondic nd depends on instruction m, then the n th opertor is configured s follows : i n = i m, i n = i m, x n = x m + l n,nd s n = u m. The difficulty comes from dydic instructions dependent on previous instructions in the group. We propose to tret dydic instructions like mondic instructions by neglecting one source opernd, tht is, predicting which of the two source use line numbers is not the mximum of the three input use line numbers. The not-the-mx predictor we simulted is -bit sturting counter stored long with the instruction. The most significnt bit of the counter indictes which source opernd to neglect. To check the prediction, we verify tht the schedule line is greter thn or equl to the neglected source use line : we compre the s n vlue with the source use line j n we neglect in the computtion of s n. If the prediction is correct, the -bit counter is strengthened, else it is wekened. Upon misprediction, the group is split : the mispredicted dydic instruction nd following instructions will be prescheduled in the next cycle. An exmple is given on Figure 5. From our experimenttions, we found n verge of one not-the-mx misprediction every 3 instructions. When fetching 8 instructions per cycle, not-the-mx mispredictions decrese the fetch rte by 5-1%. 5.3. Dedlocks Keeping the RULT coherent nd stlling upon preschedule window overflow ensures tht if n instruction B is register-dependent on n instruction A, then B cnnot enter in the issue buffer before A. So the dt-flow prescheduler described previously cnnot experience dedlocks becuse of register dependencies. Lod-store dependencies cnnot generte dedlocks. A lod is lwys scheduled on line fter the store it is predicted to depend on. Note tht if the prescheduler filed to detect lod-store dependency, the lod cnnot be blocked in the issue buffer by the store. Nevertheless, rtificil dependencies between stores in the sme store set cn cuse dedlocks, becuse they re not tken into ccount in the schedule line computtion of store. However, such dedlocks re very rre. Most of 6

our simultions experienced no dedlock t ll. Only few simultions experienced dedlocks, but never with less thn 1 million instructions per dedlock. Dedlocks cn be detected nd solved esily : when no instructions hve been issued for certin number of cycles, we relese ll stores in the issue buffer by clering their rtificil dependencies. These dependencies re not necessry for correct execution, they were introduced only for reducing the number of memory order trps. 6. Experimentl evlution 6.1. Line size trde-off The line size is n importnt prmeter of the dt-flow prescheduler. If the line is chosen too smll, prescheduling will stll too often, limiting the effective instruction window. On the other hnd, if the line is chosen too lrge, mny wrong-pth instructions will enter the issue buffer before correct-pth instructions nd my dely the correct pth. The effective instruction window grows proportionl to the squre of the line size. As the line size is incresed, prescheduling stlls less frequently, nd more instructions cn enter the prescheduler. So there is direct reltion between the line size nd the effective instruction window. We simulted n idel configurtion, replcing the e- gskew brnch predictor with perfect brnch predictor. In these conditions, the instruction fetch rte is limited only by line overflows. In this experiment nd ll subsequent ones, the preschedule window is fixed to 18 physicl lines so tht it is not performnce bottleneck. Figure 6 shows the IPC with nd without prescheduler. For the configurtion with prescheduler, we keep the issue buffer size fixed to 3, nd we vry the line size. For the configurtion with no prescheduler, we vry the issue buffer size. The issue width is kept equl to the issue buffer size. This experiment shows the reltion between the line size nd the effective instruction window size. For exmple, prescheduler with line size of 16 instructions gives the sme IPC s n n issue buffer of 18 instructions (for the instruction ltencies simulted, nd with unlimited execution resources). We observed tht the verge number of instructions witing in the preschedule buffer is roughly proportionl to the squre of the line size, which is coherent with the squre-root lw observed in [9]. It should be noted tht, on Figure 6, the issue width is lrger thn the line size. However, when execution resources re limited, dt-flow prescheduler is not exctly equivlent to lrge issue buffer, becuse the dt-flow order differs from the optiml issue order. In cse of resource conflict, lrge issue buffer should give priority to older instructions. A dt-flow prescheduler does not hve this degree of freedom. In prticulr, it is possible for wrong-pth instructions to dely the execution of correctpth instructions if the line size is lrger thn the issue width. Smpling method. Our simultor is trce driven, it is not ble to simulte instructions on the wrong pth. However, we hve simulted the impct of wrong-pth instructions from the observtion tht, from the point of view of the dt flow structure, it is very hrd to distinguish the wrong pth from the correct pth (otherwise, this would provide wy to detect mispredicted brnches). This observtion led us to smpling method, using correct-pth instructions to simulte the wrong pth. A similr technique ws used in []. The whole instruction trce is injected in the simultor, s usul, so tht its internl structures (brnch predictor, cche, store sets,...) re kept wrm. However, we collect sttistics only for one slice every 1 on verge. We define slice s piece of instruction trce delimited by two consecutive brnch mispredictions. For simulting the wrong pth, we inject in the simultor the correct-pth instructions which follow the slice currently smpled. The time counter strts counting when the first instruction in the slice is fetched, nd the counting stops when the mispredicted brnch ending the slice is executed. The smple IPC is the totl number of instructions in ll slices divided by the time cumulted on ll smples. To verify the vlidity of the method, we hve simulted lrge issue buffer giving priority to older instructions, so tht instructions on the wrong pth hve no effect. We lso rn simultions without smpling so s to obtin the orcle IPC, tht is, the IPC obtined when the instruction fetching stlls fter ech mispredicted brnch. The difference between the smple IPC nd the orcle IPC mesured on the IBS benchmrks re within ±% for 5 of the 8 IBS benchmrks, the three others being.% (gs), +.7% (sdet) nd 3.8% (nroff). It should be noted tht our smpler uses rndom number genertor which is lwys initilized with the sme seed. Hence the sequence of slices tht re smpled is fixed for given benchmrk, which mkes comprisons sfer. In the remining, the smple IPC is used s the performnce metric. Impct of wrong pth instructions. Figure 7 shows the smple nd orcle IPC mesured on n idel configurtion s function of the line size, for n issue width of nd 8. To gin plce, we show only the hrmonic men on ll benchmrks. The issue buffer size ws fixed to 3, so tht it is not performnce bottleneck. The difference between the orcle nd smple IPC vlues quntifies the performnce loss ssocited with potentilly issuing wrong-pth instructions before correct-pth instructions. We observe tht when the line size is equl to the issue width, the instructions on the wrong pth hve no impct on performnce, which is coherent. As the line size increses, so does the effective instruction window, nd this increses the smple IPC. However, fter certin line size, wrong-pth instructions begin to 7

IPC 1 1 1 8 6 IPC 1 1 1 8 6 groff gs mpeg_ply nroff rel_gcc sdet verilog video_ply 8 1 16 line size 8 16 3 6 18 56 issue buffer Figure 6. IPC of n idel configurtion with perfect brnch prediction. On the left grph, there is prescheduler. nd we vry the line size. On the right grph, there is no prescheduler nd we vry the issue buffer size. IPC (hrmonic men) 5.5 3.5 3.5 1.5 1.5 8-issue smple 8-issue orcle -issue smple -issue orcle 8 1 16 line size IPC 3..5. 1.5 1..5. no presch. db presch. df presch. groff gs mpeg nroff gcc sdet verilog video Figure 7. Dt-flow prescheduling on n idel configurtion. Hrmonic men on ll benchmrks of the smple nd orcle IPC s function of the line size, for n issue width of nd 8. Figure 8. IPC on 8-wy configurtion with 8- entry issue buffer with dt-flow prescheduler (1- instruction lines), dependence-bsed prescheduler (8 FIFOs), nd with no prescheduler. consume too much issue bndwidth, nd the smple IPC flls. With the instruction ltencies simulted, the optiml line size is pproximtely 5 % lrger thn the issue width. For exmple, for n issue width of 8, we should tke line size of 1. In this cse, on our simultions, wrong-pth instructions generte 5% performnce loss on verge. 6.. Dt-flow prescheduling effectiveness In this section, we compre three 8-wy configurtions with the sme issue buffer size: one uses dt-flow prescheduler, nother uses dependence-bsed prescheduler (the issue buffer size is the number of FIFOs), nd the lst hs no prescheduler. For the dt-flow prescheduler, the line size is set to 1 instructions (following the conclusion of Section 6.1) nd we tke into ccount specific implementtion constrints : virtul line numbers re coded on 1 bits, the preschedule stge uses not-the-mx predictions, nd the pipeline front- IPC. 3.5 3..5. 1.5 1..5. no presch. db presch. df presch. groff gs mpeg nroff gcc sdet verilog video Figure 9. IPC on 8-wy configurtion with 16- entry issue buffer with dt-flow prescheduler (1- instruction lines), dependence-bsed prescheduler (16 FIFOs), nd with no prescheduler. 8

IPC (hrmonic men) 3.5 3.5 1.5 1.5 no presched. db presched. df presched. 8 16 3 6 18 56 issue buffer size Figure 1. Hrmonic men of the IPC on ll benchmrksswevrytheissuebuffersize. IPC (hrmonic men).5 1.5 1.5 no presched. db presched. df presched. 8 16 3 6 18 56 issue buffer size Figure 11. Hrmonic men of the IPC when the L1 cche is removed nd predicted lod ltencies correspond to L cche ccess. end fetures 13 stges insted of 1 for the other two configurtions (i.e., ssuming dt-flow prescheduling requires 3 extr pipeline stges). Figures 8 nd 9 show the IPC for issue buffer sizes 8 nd 16 respectively. First, we observe tht the dt-flow prescheduler, on verge, outperforms the dependence-bsed prescheduler for these issue buffer sizes. With 8-entry issue buffer, dt-flow prescheduler is on verge % more performnt thn dependence-bsed prescheduler nd 5% more performnt thn with no prescheduler. With 16-entry issue buffer, dt-flow prescheduler is still 7% more performnt thn dependence-bsed prescheduler nd 33% more performnt thn with no prescheduler. Anlysis. Figure 1 shows the hrmonic men of the IPC on ll benchmrks s we vry the issue buffer size. With dt-flow prescheduler, it is beneficil for the issue buffer to be lrger thn the line size : the IPC with n issue buffer of 16 is higher thn with n issue buffer of 8. The min reson is tht the dt-flow order is not the optiml issue order becuse of the limited issue width. An issue buffer lrger thn the line size gives more opportunities to the issue logic for correcting the preschedule order nd get closer to n optiml issue order. We cn observe tht incresing the issue buffer size from 16 to 3 brings on verge slight performnce gin with dt-flow prescheduler. Actully, looking t benchmrks individully, it is correlted with the frequency of dt cche misses. Benchmrks with high dt cche miss rte (e.g., verilog, video ply) benefit from n issue buffer of 3, wheres benchmrks with few cche misses (e.g., nroff) do not. This is becuse cche misses degrde the ccurcy of the predicted dt-flow order. From these curves, it ppers tht n effective instruction window of 18 instructions is sufficient (ctully, rel gcc requires window of only 6 instructions becuse the distnce between brnch mispredictions is twice smller thn for other benchmrks). Without prescheduler, we would need very lrge issue buffer to implement such lrge window. With dependence-bsed prescheduler, the difficulty is hlved : 16 FIFOs emulte n effective window of bout 3 instructions. On the other hnd, ccording to Figure 6, dt-flow prescheduler emultes window of bout 6 instructions with line size of 1. In prctice, prt of this potentil is consumed by the impct of wrong pth instructions, by the extr pipeline stges, nd by cche misses degrding the ccurcy of the predicted dt-flow order. To better demonstrte the potentil of dt-flow prescheduling nd give hints for future works, we hve performed simple experiment which results re shown on Figure 11. In this experiment, we remove the L1 dt cche so tht ll lods ccess directly to the L cche, nd the prescheduler predicts tht the lod ltency corresponds to L cche ccess. We cn observe tht this emphsizes the importnce of lrge issue buffer : lrger instruction window is needed to sturte the execution units. We cn lso observe tht we get the full potentil of the dt-flow prescheduler with n issue buffer of 16. This is becuse the dt-flow order is now very ccurte, s there re no longer L1 cche misses. More interesting, the no-prescheduler curve now crosses the dt-flow-prescheduler curve t n issue buffer size of 6, despite the impct of wrong-pth instructions nd extr pipeline stges. By predicting longer lod ltencies, we decrese the frequency of line overflows nd we llow more instructions to enter in the prescheduler. In other words, predicting longer ltencies enlrges the effective instruction window. Now, with the sme 1- instruction line size, we re emulting n effective window lrger thn 6 instructions. 7. Conclusion nd future works The issue buffer is one of the criticl pipeline stges in modern out-of-order processors. The trversl time of the issue stge increses with the issue buffer size. This my 9

prevent the implementtion of lrge issue buffers. Dt-flow prescheduling llows to reorder instructions dynmiclly. The gol is to push instructions in the issue buffer in the dt-flow order rther thn in sequentil order. This llows to rech the sme IPC using smller issue buffer. The implementtion proposed in this study is only point in the design spce. In prticulr, we did not explore the possibility of introducing ssocitivity in the preschedule buffer. Associtivity might be useful for smoothing the precheduler behvior. This concerns both the utiliztion of the preschedule buffer spce nd the definition of line overflows. Prescheduling (or other techniques tckling the sme problem) should be viewed s wy to tolerte long instruction ltencies. The min sitution requiring lrge instruction window is when there is not enough instruction prllelism to sturte the execution units. This is often the cse on code sections experiencing frequent dt cche misses. The dt-flow prescheduler we simulted predicts tht ll lod ltencies correspond to L1 dt cche hit. However, for pplictions with frequent cche misses, we would like the prescheduler to predict longer lod ltencies. As prt of future work, it would be interesting to study how the memory hierrchy design could tke dvntge of the ltency tolernce fforded by prescheduler. In prticulr, hit-miss prediction techniques [17, 7] should be considered s prt of the problem. Future work should lso focus on the problem of bypss ltencies. In this study, we ssumed centrlized instruction window feeding compct pool of execution units. However, clustered rchitectures re ppering, with restricted bypss networks. Dt-flow prescheduling might lso be interesting for those rchitectures. References [1] V. Agrwl, M.S. Hrishikesh, S.W. Keckler, nd D. Burger. Clock rte versus IPC: the end of the rod for conventionl microrchitectures. In Proceedings of the 7th Annul Interntionl Symposium on Computer Architecture,. [] M. Butler nd Y. Ptt. An investigtion of the performnce of vrious dynmic scheduling techniques. In Proceedings of the 5th Interntionl Symposium on Microrchitecture, 199. [3] R. Cnl nd A. González. A low-complexity issue logic. In Proceedings of the 1th Interntionl Conference on Supercomputing,. [] G. Chrysos nd J. Emer. Memory dependence prediction using store sets. In Proceedings of the 5th Annul Interntionl Symposium on Computer Architecture, 1998. [5] D.S. Henry, B.C. Kuszmul, G.H. Loh, nd R. Smi. Circuits for wide-window supersclr processors. In Proceedings of the 7th Annul Interntionl Symposium on Computer Architecture,. [6] N.P. Jouppi nd P. Rngnthn. The reltive importnce of memory ltency, bndwidth, nd brnch limits to performnce. Workshop on Mixing Logic nd DRAM (ISCA 97). http://irm.cs.berkeley.edu/isc97-workshop/. [7] R.E. Kessler. The lph 16 microprocessor. IEEE Micro, Mrch 1999. [8] D. Leibholz nd R. Rzdn. The Alph 16: 5 MHz out-of-order execution microprocessor. In Proceedings of IEEE COMPCOM, 1997. [9] P. Michud, A. Seznec, nd S. Jourdn. Exploring instruction-fetch bndwidth requirement in wide-issue supersclr processors. In Proceedings of the Interntionl Conference on Prllel Architectures nd Compiltion Techniques, 1999. [1] P. Michud, A. Seznec, nd R. Uhlig. Trding conflict nd cpcity lising in conditionl brnch predictors. In Proceedings of the th Annul Interntionl Symposium on Computer Architecture, 1997. [11] S. Plchrl, N. Jouppi, nd J.E. Smith. Complexityeffective supersclr processors. In Proceedings of the th Interntionl Symposium on Computer Architecture, 1997. [1] N. Rngnthn nd M. Frnklin. An empiricl study of decentrlized ILP execution models. In Proceedings of the 8th Interntionl Conference on Architecturl Support for Progrmming Lnguges nd Operting Systems, 1998. [13] E. Rotenberg, Q. Jcobson, Y. Szeides, nd J. Smith. Trce processors. In Proceedings of the 3th Interntionl Symposium on Microrchitecture, 1997. [1] J.E. Smith nd A.R. Pleszkun. Implementtion of precise interrupts in pipelined processors. In Proceedings of the 1th Annul Interntionl Symposium on Computer Architecture, 1985. [15] S.T. Srinivsn nd A.R. Lebeck. Lod ltency tolernce in dynmiclly scheduled processors. In Proceedings of the 31th Annul Interntionl Sympoisum on Microrchitecture, 1998. [16] R. Uhlig, D. Ngle, T. Mudge, S. Sechrest, nd J. Emer. Coping with code blot. In Proceedings of the nd Annul Interntionl Symposium on Computer Architecture, June 1995. [17] A. Yoz, M. Erez, R. Ronen, nd S. Jourdn. Specultion techniques for improving lod relted scheduling. In Proceedings of the 6th Annul Interntionl Symposium on Computer Architecture, 1999. 1