Allocating Rotating Registers by Scheduling

Size: px

Start display at page:

Download "Allocating Rotating Registers by Scheduling"

Clyde Copeland
5 years ago
Views:

1 Alloating Rotating Registers by Sheduling Hongbo Rong Hyunhul Park Cheng Wang Youfeng Wu Programming Systems Lab Intel Labs ABSTRACT A rotating alias register file is a salable hardware support to detet memory aliases at run-time. It has been shown that it an enable instrution-level parallelism to be effetively exploited from sequential ode. Yet it is unknown how to apply it to loops. This paper presents an elegant and effiient solution that alloates rotating alias registers for a software-pipelined shedule of a loop. We show that surprisingly, this speifi register alloation problem an be redued to another software pipelining problem, for whih numerous effiient algorithms are available. This is interesting in both theory and pratie. We propose an algorithmi framework to solve the problem. We also present a simple software pipelining algorithm that speially targets register alloation. Comparison with a few other algorithms shows that it usually ahieves the best alloation at the least time ost. Finally, we generalize the approah to alloate generalpurpose (integer/floating-point/prediate) rotating registers by showing that it is also a software pipelining problem. Categories and Subjet Desriptors D.3.4 [PROGRAMMING LANGUAGES]: Proessors Compilers, Optimization General Terms Algorithms, Experimentation, Languages, Theory Keywords Register Alloation, Alias, Sheduling, Software Pipelining 1. INTRODUCTION Memory disambiguation is a fundamental omponent in optimizing ompilers. It disovers unaliased memory operations, i.e., loads or stores that visit different memory loations. These operations may be sheduled to run out of Permission to make digital or hard opies of all or part of this work for personal or lassroom use is granted without fee provided that opies are not made or distributed for profit or ommerial advantage and that opies bear this notie and the full itation on the first page. Copyrights for omponents of this work owned by others than ACM must be honored. Abstrating with redit is permitted. To opy otherwise, or republish, to post on servers or to redistribute to lists, requires prior speifi permission and/or a fee. Request permissions from Permissions@am.org. MICRO 46, Deember 7-11, 213, Davis, CA, USA Copyright 213 ACM /13/12...$15. order for better instrution-level parallelism. Conversely, memory operations that visit the same memory loations are aliased, and annot be sheduled out of order. Memory disambiguation is usually done in a ompiler by alias analysis. Aurate alias analysis, however, is expensive in terms of ompile time. In this paper, we are interested in only memory operations between whih the alias relationship is hard to determine by alias analysis. We all them may-alias operations for brevity. A dynami ompiler optimizes ode at the time the ode runs. So the ompile time is part of the exeution time of the ode, and has to be short. Under this tight ompiletime onstraint, a dynami ompiler an perform only simple alias analysis. Besides, a dynami ompiler often works on binary ode without high-level soure information, and has to be onservative in the analysis. Therefore, for effetive memory disambiguation, hardware support is usually needed by dynami ompilers. For mayalias operations, the ompiler optimistially assumes they never alias with eah other, so as to speulatively shedule them to run out of order; the ompiler sets up the hardware to detet any aliases, if any, when the optimized ode runs. When an alias is deteted, an exeption is thrown and some reovery ode will then be triggered to anel any effets of the failed speulation. Suh a speulation failure, of ourse, is expeted to be rare. There are several kinds of hardware support, inluding the Advaned Load Address Table (ALAT) in Itanium [1], the stati alias register file in Transmeta proessors [7], and more reently, the rotating alias register file (previously alled alias register queue in [19]), and DeAliaser [2]. This paper targets the rotating alias register file. Compared with ALAT and stati alias register file, it has shown advantages in terms of smaller spae requirement for instrution enoding, better salability, and/or less false-positives [19]. Compared with DeAliaser, it allows the heking of a subset of, instead of all, speulative stores, and thus it is more aurate in deteting aliases. A false-positive is an unneessary disovery of alias by the hardware, as to be explained in Setion 2. The lesssalable stati alias register file an be used as a seondary mehanism, and in some orner ases, a stati alias register an be used in plae of a rotating alias register. This is alled a spill, whose purpose will beome lear later in Setion 4.3. The rotating alias register file an be saled to a large size to allow aggressive memory speulation in a large piee of ode. It has enabled ayli ode to be optimized effe- 346

2 tively [19]. How to apply it to loops, however, is a new and open problem. This paper presents an effiient solution that applies rotating alias registers to a software-pipelined shedule of a loop. Software pipelining [1, 13, 18] exploits instrution-level parallelism from a loop by overlapping the exeution of suessive iterations. This optimization has been studied extensively in the past 3 deades, and is widely aknowledged as one of the most effiient optimizations for wide-issue arhitetures, benefiting both VLIW [1, 13] and supersalar [18] mahines. So far, software pipelining is seen only in stati ompilers. However, as dynami ompilers beome inreasingly important today, it is desirable to extend it to dynami ompilers. The benefit is that software pipelining broadens the optimization sope of a dynami ompiler. Dynami ompilation usually optimizes a small piee of ode due to the time onstraint at run-time [7]. For a loop, the sope is usually a loop iteration. Software pipelining enlarges the sope to an entire loop, inluding all its iterations. This potentially permits more aggressive speulation to expose more parallelism. The problem to be solved an be stated as follows: Given a rotating alias register file and a software-pipelined shedule of a loop, how to alloate the rotating alias registers to the memory operations in the shedule, suh that the alloation an detet all aliases between the memory operations when the shedule runs, without any false positive, with the minimal number of spills, and with the minimal number of registers alloated? This paper makes the following ontributions: 1. Problem formulation. It shows that surprisingly, this speifi register alloation problem an be redued to another software pipelining problem, and therefore, any software pipelining algorithm an be used to solve it. This is not only interesting in theory, but also useful in pratie: it learly exposes the nature of the problem, and enables the use of the numerous effiient sheduling algorithms. Traditionally, sheduling and register alloation are solved from different perspetives. The tehniques are fundamentally dissimilar. Sheduling tehniques are based on the dependenes between operations. Dependenes are diretional, e.g., operation a is dependent on operation b, but not vie versa. In ontrast, register alloation tehniques are based on the interferenes between the lifetimes of variables. Interferenes are not diretional, e.g., lifetime a interferes with lifetime b, and vie versa. However, due to the speifi way the rotating alias register file works for alias detetion, rotating alias register alloation an be naturally formulated as a sheduling problem, as we will see later. It is also interesting that in this software pipelining problem, the dependene and resoure onstraints an be unified under a single funtion. 2. Algorithms. Based on the problem formulation, the paper proposes an algorithmi framework. In this framework, the requirements of alias detetion are transformed into dependenes. Then based on the dependenes, software pipelining is performed. Afterwards, some lifetimes are spilled to stati alias registers to address some rare orner ases. Remember the problem has 4 requirements: (1) all aliases an be deteted, (2) there is no false positive, (3) spilling is minimized, and (4) register usage is minimized. The framework ensures that the first two requirements are met. Software pipelining is the key to meet the other two requirements. Any software pipelining approah an be applied to orretly solve the alloation problem, but they may differ in how muh spilling and register usage an be minimized. We propose a simple software pipelining algorithm, alled LCP (Loal Compation followed by Paing), that speially targets this alloation problem with two heuristis, the earliest-start and max R heuristi. 3. Evaluation. The algorithmi framework, the LCP algorithm, along with a few other software pipelining algorithms, have been implemented in Transmeta Code Morphing Software [7]. We have ompared the effetiveness of the algorithms in alloating registers, and shown that LCP usually ahieves the best results and runs the fastest. With LCP, most loops are alloated the minimal number of rotating alias registers, and spills are minimized very well with the earliest-start heuristi. We also show that rotating alias registers are important for performane. They enable the maximal instrutionlevel parallelism to be exploited from loops. 4. Generalization. Now that alloating rotating alias registers is formulated as a software pipelining problem, one annot help wondering the generality of this formulation. Is alloating rotating general (integer/floating-point/prediate) registers also a software pipelining problem? The answer is yes. Thus the problem may be solved in a similar way. We show that from this formulation, we an derive the bin-paking approah of Rau et al [14]. Below we first introdue some bakground knowledge in Setion 2. Then we motivate our rotating alias register alloation approah by an example in Setion 3. Subsequently, we generalize the example to a formal solution in Setion 4 and show experimental results in Setion 5. Then we extend the solution to alloate rotating general registers in Setion 6. Finally, we disuss related work and reah a onlusion. 2. BASIC CONCEPTS In this setion, we briefly introdue software pipelining and alias registers. 2.1 Software Pipelining Software pipelining overlaps the exeution of the iterations of a loop under dependene and resoure onstraints. Modulo sheduling might be the most ommon approah 347

3 of software pipelining. In this paper, we fous only on modulo sheduling, and use the two terms, modulo sheduling and software pipelining, interhangeably. We assume a loop body is a hyperblok [11], where branhes, if any, have been onverted either to prediated ode [3], or to asserts [12]. Formally, let a(i) be operation a in iteration i of the loop before software pipelining, and σ(a, i) be the shedule time of it after software pipelining. A modulo shedule must satisfy the following onstraints: Modulo property: Eah iteration of the loop has the same shedule, and suessive iterations are initiated at a onstant period alled Initiation Interval (). That is, Dependene onstraints: σ(a, i + 1) = σ(a, i) +, i. (1) The shedule must respet every dependene (a b, δ, d), where δ is the lateny and d is the distane. The dependene is a loal dependene if the distane d =, and a loop-arried dependene otherwise. By respeting the dependene, the shedule ensures that a(i) starts at least δ time steps earlier than b(i+d). That is, set b r x a hek (a) An alias register is set by operation b, and then heked by operation a. b d a sr r () One way to avoid the false positive in Fig. 1b is using a stati register sr. Note that a heks both the stati register sr and the rotating register r in Step 3. b d a r 1 r (b) With a rotating alias register file, a set of alias registers are heked. Operation nd d exeute, and they set register r 1 and r, respetively (Step 1 and 2 ). When operation a and exeute, they hek both registers, starting from r (Step 3 and 4 ). b d a r 1 r (d) Another way to avoid the false positive in Fig. 1b is to find a better alloation. Figure 1: Illustrating alias registers. Resoure onstraints: σ(a, i) + δ σ(b, i + d), i. (2) No hardware resoure is used at the same time by two operations. 2.2 Alias Registers Software pipelining may reorder operations. For example, assume there are a pair of may-alias operations, a and b. They may be from the same iteration or different iterations of the loop. Let us say their sequential exeution order is a, b, but after software pipelining, their exeution order beomes b, a. Alias registers are used to detet the alias between them, if any, when the ode runs. In Fig. 1a, first, operation b runs. It sets an alias register r x, where x is the index of the register. By setting the register, it reords the memory range it aesses into the register. Then operation a runs, and it heks r x to see if its own memory range overlaps with b s. If so, an alias is deteted. We say a heks b or a is a heker of b. An operation that only heks other operations but is not heked by any other operation is a pure heker. Usually, a memory operation sets only one alias register, whih is enough to reord its own memory range. The index of the alias register it sets is enoded in it diretly. But it may need to hek multiple alias registers, beause it may alias with multiple memory operations. There are two ases: the stati alias registers to hek are enoded in a bit-mask, where every bit equal to 1 orresponds to a stati register; the rotating alias registers to hek are speified by one and only one single index x, whih means to hek rotating registers r x, r x+1, r x+2,..., r n, where r n is the highest-indexed rotating register. It is important to see that the heking of rotating alias registers is unidiretional, i.e., from a lower-indexed register up toward the highestindexed register. This range of registers must inlude all the registers the operation intends to hek, but it may also inlude other registers that the operation does not intend to hek, whih may trigger false positives. A stati alias register file annot be too big, due to the limited enoding spae to ontain a bit-mask in an operation. A rotating alias register file avoids this limitation, but it an ause false-positives. In Fig. 1b, assume 4 operations that may alias with eah other whose sequential exeution order is a, b,, d, but after software pipelining, whose exeution order beomes b, d, a,. Suppose there are only 2 rotating alias registers. The figure has shown step by step how aliases may be deteted. For example, operations a has been reordered with both nd d, and thus when it exeutes, it needs to hek the registers set by nd d, i.e., r 1 and r. It does so by speifying only r, and the hardware will automatially hek both registers, starting from r, due to its unidiretional heking feature(step 3 ). Similarly, operations has been reordered with d, and thus when it exeutes, it needs to hek the register set by d, i.e., r (Step 4 ). However, due to the unidiretional heking feature, the hardware will also hek r 1, whih is set by b. This hek is unneessary, sine and b have not been reordered at all. This unneessary hek an lead to an unneessary exeption, i.e., a false-positive. There are two ways to avoid this false-positive. One is spilling: let b sets a stati, instead of a rotating, register sr. Unless it is expliitly speified in the bit-mask of an operation, the stati register won t be heked. See Fig. 1. The other way is to have a better alloation like that in Fig. 1d. One an verify that this alloation does not introdue any false-positive. The rotating register file is organized as a irular buffer. 348

4 Starting from one of the registers alled base, the registers are indexed as, 1,..., n. A rotation ation, rotate x, leans register, 1,..., x 1, and sets register x as the new base. Starting from the new base, the registers are re-indexed as, 1,..., n. A memory operation an speify only one rotating alias register with it. It an speify to hek it (whih will also hek the higher-indexed registers), or set it, or both hek and set it. The operation leaves the remaining spae to enode the rotation ation. In exeuting the operation, a rotation ation, if speified, is performed first; then heking of alias registers is done, if speified; and finally, setting of an alias register is done, if speified. The result of a setting is stiky : One a register is set, its ontent does not hange until the register is set again by another operation, or leaned by a rotation ation. 2.3 Terminology Before Setion 6, the paper is on alias registers for memory operations. Other kinds of registers and operations are irrelevant to our problem. Thus to be short, from now on until Setion 6, by register and operation, we will refer to alias register and memory operation by default, unless stated otherwise. 3. A MOTIVATING EXAMPLE In this setion, we motivate our register alloation approah with an example. It is extremely simplified, but is still relevant to onvey the ore information. Fig. 2a shows a loop ontaining a few operations that may alias with eah other, a, nd. A software-pipelined shedule for the loop is illustrated in Fig. 2b with the first few iterations. We ignore the irrelevant details how the shedule was generated. Eah iteration has the same shedule. The iterations are initiated at an interval of 3 time steps ( = 3). Note the reordering of the operations after software pipelining: in Fig. 2b, b(i) is sheduled after both (i) and (i + 1) for any i; also, a(i) is sheduled after b(i), (i) and (i + 1). These are different from the sequential exeution order of the original loop in Fig. 2a. Suh reordering happens beause the ompiler optimistially assumes the operations never alias. However, if they do alias (oasionally during exeution of the software-pipelined shedule), the exeution results of the shedule would be wrong, and some reovery ode must be performed. Rotating alias registers are used to guard the operations for alias detetion at the exeution time. Eah operation produes a lifetime. A rotating register is alloated to the lifetime. When the operation starts exeution, it sets the register. At that time, the lifetime starts. The lifetime is live until the register has been heked by all the hekers of the operation. At that time, the lifetime ends. We all the lifetime produed by any operation o as lifetime o for short. A pure heker does not set any register, and we assume it produes a lifetime whose length is in eah iteration. All the lifetimes need to be plaed into registers in ertain order. For example, let x be the register index of lifetime b(i). In order to detet alias between operation b(i) and (i), and between operation b(i) and (i + 1), x must not be higher than the register indies of lifetime (i) and (i + 1). Then when operation b(i) starts exeution, it heks registers, starting from r x. The unidiretional heking feature of the hardware will guarantee that the registers of lifetime (i) and (i + 1) are heked, and thus detet any alias with them. Suh ordering requirements an be expressed in dependenes between the lifetimes, if we an view eah register index as a time step. For example, to detet alias between operation b(i) and (i), we an build a dependene (b,, ), whih requires lifetime b(i) starts at least time step earlier than lifetime (i) in sheduling terms, i.e., the register index of lifetime b(i) is not higher than the register index of lifetime (i). Similarly, we an build a dependene (b,, 1) to detet alias between operation b(i) and (i + 1). All suh ordering requirements ompose a dependene graph, as shown in Fig. 2. Here every node represents a lifetime. Eah dependene edge is annotated with a dependene distane. The lateny of every dependene is and not shown. Note that there is no dependene to a, whih means operation a is a pure heker and the length of lifetime a(i) is for any i. Based on the dependene graph, we an shedule the lifetimes to registers, and get an alloation in Fig. 2d. In this diagram, the horizontal axis is time, and the vertial axis is register index. We assume there is an unlimited number of registers, and do not onsider rotation: that is a renaming issue that an be addressed afterwards. In the alloation, the bars represent the lifetimes. We have marked the lifetimes with the orresponding operations. Operation a is a pure heker, and its lifetime shares a register with lifetime b in the same loop iteration. We an make the following observations from the alloation in Fig. 2d: 1. The lifetimes produed by the same operation from suessive loop iterations appear along the axis of the registers at a onstant period R. For example, lifetimes b in iteration, 1, and 2 appear at register, 1, 2, respetively. The period R is equal to 1. If we an view eah register index as a time step, then these lifetimes are initiated at a onstant time interval R. This is analogous to the modulo property of modulo sheduling desribed in Setion The alloation respets all the dependenes, i.e., the ordering requirements between the lifetimes. This is analogous to the dependene onstraints of modulo sheduling desribed in Setion 2.1. Lifetime a(i) and b(i) are alloated register i, and lifetime (i) is alloated register i + 1. We an verify that all the ordering requirements are respeted. For example, as required by the dependenes (b,, ) and (b,, 1), the register index of lifetime b(i), i, is not higher than the register indies of lifetime (i) and (i + 1), whih equal i + 1 and i + 2, respetively. To be learer, we have illustrated the two dependenes in Fig. 2d. 3. In the alloation, when two lifetimes are alloated the same register, they annot overlap in time. If every time step is viewed as a resoure, then this means no resoure is over-ommitted. This is analogous to the resoure onstraints of modulo sheduling desribed in Setion

5 for (i=; i<n; i++){ a b } In this setion, we generalize our solution from the motivating example and formally formulate the rotating register alloation problem as a modulo sheduling problem. Based on this formulation, we present an algorithmi framework to solve it, and also propose a software pipelining algorithm, alled LCP, speifially for this alloation problem. (a) A loop 4.1 Iteration Iteration 1 =3 Iteration 2 (b) A software-pipelined shedule a <> <> b <> <1> <1> () The dependene graph of the lifetimes. Iteration 1 Iteration 1 Iteration R=1 Modulo property: r(a, i + 1) = r(a, i) + R, (d) A register alloation. 1* Iteration 2 Iteration 1 Iteration 2 3* 2* b a 1 r(a, i) r(b, i + d) i. (4) That is, lifetime a(i) needs to be plaed in the same register as b(i + d) or in a lower-indexed register than b(i + d). This onstraint an be modeled by a dependene (a b,, d). (e) Register assignment. A means a rotate R ation, and oi means lifetime o is assigned register ri. Figure 2: A Motivating Example Resoure onstraints: If two lifetimes are alloated the same register, they annot overlap in time. For example, register 1 is alloated to lifetime (), b(1), and a(1). We an see () has ended before b(1) starts. Visually, b(1) seems to overlap with a(1) in time. This is not real sine a(1) has a length of, i.e., it does not onsume time at all. In short, the alloation respets 3 onstraints analogous to those of modulo sheduling. It is a modulo shedule of the lifetimes. 4. (3) Dependene onstraints: Suppose a(i) and b(i + d), where d, may alias with eah other. Suppose a(i) is before b(i + d) in the sequential exeution order of the loop, but is after it in the software-pipelined shedule. To make sure any alias between them is deteted, we must let 4* i. Here R is a onstant to be determined during the alloation proess. That is, the lifetimes of an operation from suessive loop iterations appear in a onstant period in registers. Register * Problem Formulation Let us reall modulo sheduling. As we introdued before, for every operation o, modulo sheduling shedules o from suessive loop iterations to time at a onstant period () and assigns o resoures, respeting all dependene onstraints and resoure onstraints. In register alloation, if we view lifetimes as operations, registers as time, and time as resoures, then we an repeat the above statement as follows: for every operation (lifetime) o, the alloation shedules o from suessive loop iterations to time (registers) at a onstant period (R) and assigns o resoures (time steps), respeting all dependene onstraints (the ordering requirement of the lifetimes) and resoure onstraints (two lifetimes in the same register never overlap in time). Therefore, the register alloation is a modulo shedule of the lifetimes. Formally, let r(a, i) be the rotating register alloated to lifetime a(i). The register alloation respets the following onstraints: SOLUTION 4.2 A Unified Expression of Dependene and Resoure Constraints Interestingly, the dependene and resoure onstraints in the above formulation an be enfored in a unified way. Let DIST (a, b) be a funtion returning the set of legal values of r(b, i) r(a, i). When r(b, i) r(a, i) equals any value in this set, all the lifetimes of a and all the lifetimes of b are alloated registers without violating any dependene or resoure onstraints. This funtion an be used in Step 2 of the algorithmi framework, to be introdued in Setion 4.3. Formally, DIST (a, b) = DISTdep (a, b) 35 \ DISTres (a, b), (5)

6 where DIST dep (a, b) and DIST res(a, b) are the sets of legal values required by dependene and resoure onstraints, respetively. To enfore any dependene (a b, δ, d) 1, we need Thus r(a, i) + δ r(b, i + d) (Inequality 2) Therefore DIST dep (a, b) = = r(b, i) + d R (Equation 3). δ d R r(b, i) r(a, i). dependene (a b,δ,d) dependene (b a,δ,d) [δ d R, + ) (, δ + d R] (6) Now onsider resoure onstraints. There are three ases. We denote the legal sets under them as DIST res i, i = 1, 2, 3, respetively. First, if operation a or b is a pure heker, there are no resoure onstraints at all. A pure heker s lifetime from any loop iteration has a length of, i.e., it does not really onsume time. Thus no lifetimes of the two operations an overlap in time. In this ase, r(b, i) r(a, i) an be arbitrary, so DIST res 1(a, b) = (, + ). (7) Seond, if a(i) and b(i+d), d, are not alloated the same register, there are no resoure onstraints, either. Sine this is equivalent to r(a, i) r(b, i + d), d r(b, i) r(a, i) d R, d (Equation 3). In other words, r(b, i) r(a, i) is not a multiple of R. So DIST res 2(a, b) = (, + ) \ R (, + ), (8) where \ is the set differene operation, and R (, + ) is the set of R s multiples. The equation means r(b, i) r(a, i) an be any number exept a multiple of R. Third, if a(i) and b(i + d), for some d, are alloated the same register, i.e., r(b, i) r(a, i) = d R, for some d, (9) then to avoid overlapping, either lifetime a(i) starts after lifetime b(i + d) ends, or lifetime a(i) ends before lifetime b(i + d) starts. That is, or start(a) + i end(b) + (i + d) end(a) + i start(b) + (i + d) where start(o) and end(o) are the start and end time of lifetime o(), for o = a, b. Therefore, start(a) end(b) d or end(a) start(b) d 1 In this paper, δ is always. Here we use δ to be general. Thus or start(a) end(b) R d R (1) end(a) start(b) R d R. (11) Summarizing Formula (9), (1) and (11), we have DIST res 3(a, b) = R (, + ) R, + ) {[ start(a) end(b) (, end(a) start(b) R]} (12) In short, by Equation (7), (8), and (12), { DISTres DIST res(a, b) = 1(a, b) if a or b is a pure heker DIST res 2(a, b) DIST res 3(a, b) otherwise (13) Example 1. We briefly explain the formula with the example in Fig. 2. Let us show how to alulate DIST (b, ). There are two dependenes between nd, i.e., (b,, ) and (b,, 1) (See Fig. 2). They restrit DIST dep (b, ) to be [, + ) [ R, + ) = [, + ), aording to Equation (6). Now onsider the resoure onstraints. Sine neither b nor is a pure heker, we have DIST res(b, ) = DIST res 2(b, ) DISTres 3(b, ), aording to Equation (13). DIST res 2(b, ) inludes all numbers exept R s multiples, aording to Equation (8). DIST res 3(b, ) inludes all R s multiples that are also within set [R, + ) (, 2 R], aording to Equation (12), given = 3 (See Fig. 2b), start(b) = 4, end(b) = 6, start() =, end() = 6 (See Fig. 2d). Together, the dependene and resoure onstraints require that r(, i) r(b, i) must be either within DIST dep (b, ) DIST res 2(b, ), or DIST dep (b, ) DIST res 3(b, ). In the alloation in Fig. 2d, we an see r(, i) r(b, i) = 1, whih is a value from DIST dep (b, ) DIST res 3(b, ): it is a multiple of R = 1 here. Example 2. For the example in Fig. 2, a less effiient but still valid alloation is shown in Fig. 3. Here R = 2. We an see r(, i) r(b, i) = 1 as well, but this time, it is a value from DIST dep (b, ) DIST res 2(b, ): it is not a multiple of R = 2 here. Iteration Iteration 1 Iteration Register R=2 Figure 3: A less effiient alloation with a bigger R for the example in Fig Algorithmi Framework Based on the problem formulation in Setion 4.1, we an find a register alloation by the following steps: 1. Dependene building. 351

7 In this step, we build the dependene graph. The graph is ensured to never have a loal iruit in it. By loal iruit, we refer to a iruit in whih the distane of every dependene edge equals. For every pair of may-alias operations, we build a dependene aording to the dependene onstraints in Setion 4.1. The dependene is added to the dependene graph. Adding suh a dependene will never ause a loal iruit to be formed. Aording to the dependene onstraints (Setion 4.1), a(i) is after b(i+d) in the pipelined shedule, and the dependene is from a to b. If d =, it means that in the pipelined shedule, the dependene is following the reverse exeution order of a single loop iteration. All the dependenes making out of the onstraints are following this same diretion. Thus it is not possible for them to form a loal iruit. For example, in Fig. 2, all the dependenes whose distane equals are in downward diretion and annot form a iruit. Besides the dependenes made from our dependene onstraints in Setion 4.1, there are some other dependenes: before a loop is software pipelined, some other optimizations might have been performed. Just like software pipelining, these optimizations an require ertain ordering between lifetimes in order to detet aliases. These requirements have been passed down from previous ompiler phases to be handled here together. Suh a requirement asks the ompiler to alloate lifetime a(i) to the same register as b(i+d), or to a lowerindexed register than b(i + d). Similar to our dependene onstraints in Setion 4.1, it is transformed into a dependene (a b,, d). Usually, adding it to the dependene graph will not form a loal iruit, with only one exeption: when d =, and a(i) is before b(i + d) in the pipelined shedule 2. This diretion is exatly the opposite to the diretion of the dependenes made out of our dependene onstraints. This exeptional ase may ause a loal iruit to be formed. When a loal iruit exists in the dependene graph, no alloation an respet all the dependenes with rotating registers alone. Sine this ase is very rare, we temporarily ignore suh loal dependenes and do not add them to the dependene graph 3. We will use stati registers to help respet them later. We all those ignored loal dependenes missing loal dependenes. 2. Modulo sheduling. 2 This kind of restrition is alled an anti-onstraint in [19].It is aused by load/store elimination before software pipelining. Unlike a normal dependene, it does not really mean that a(i) should hek b(i+d). Instead, it just wants to make sure that b(i+d) does not hek a(i), to avoid a false positive. For example, to avoid the false-positive in Fig. 1b, disussed in Setion 2, a dependene b an be added, and that would lead to an alloation without any false-positive shown in Fig. 1d. As our solution misses the loal dependenes of this kind, we may have to use stati registers to help avoid a false-positive like that shown in Fig An alternative solution is to allow suh dependenes to be added if they do not really ause any loal iruit to form. In this step, we an apply any modulo sheduling algorithm to shedule the lifetimes to the rotating registers, based on the dependene graph, and onsidering the modulo property and resoure onstraints desribed in Setion 4.1. Modulo sheduling ommonly searhes for a feasible initiation interval within a range. For eah initiation interval R under onsideration, it would shedule lifetimes. For eah lifetime, it ensures that all dependene and resoure onstraints between this lifetime and all the already sheduled lifetimes are respeted. To ensure that, one an use the legal distane alulated by Equation (5), (6), and (13). 3. Removing potential false positives. For eah missing loal dependene a b, hek if it is respeted by the alloation. If not, spill lifetime b(i), i, to a stati alias register, i.e., dealloate the rotating register of b(i) and alloate b(i) a stati register, instead 4. Another kind of false positive is introdued by register reusing during modulo sheduling. When an operation a(i) starts, it may aidentally hek another operation b(j) that it does not intend to hek, where i and j may be arbitrary: lifetime b(j) may be dead, but its register might not be leaned yet and thus a(i) will hek b(j) in effet. In this ase, we also spill lifetime b(j) to a stati alias register. There is no other known soures of false-positives so far. Now that all the ordering requirements have been respeted, the resulting shedule, when it exeutes, will detet all aliases without any false positive. 4. Register assignment. The register alloation we have found assumes infinite number of rotating registers, and it does not rotate the register file. In reality, the number of rotating registers is limited, and we have to rotate the register file periodially in order to lean up some dead lifetimes in the registers, and free the registers for other lifetimes that newly start. We ahieve this purpose by inserting a rotation ation, rotate R, into the software- pipelined shedule of the loop. When the shedule runs, every time steps, R number of dead lifetimes are leaned up and their registers are freed. Fig. 2e shows the register assignment for the alloation in Fig. 2d. Every time steps, a rotate R ation, 4 At this moment, for all i, the same stati register is alloated to lifetime b(i). Later in the ode generation phase (See an example ompile flow in Fig. 5), that register may be renamed to more than one register, so that if lifetime b(i) and b(i+1) overlap in time, the register is renamed suh that b(i) s register is different from b(i + 1) s. This is alled Modulo Variable Expansion (MVE) [1]. The register needs to be expanded to at least len registers, where len is the length of any lifetime b(i). The general-purpose (integer/floatingpoint/prediate) lifetimes are handled exatly the same way in MVE during ode generation, if they do not have rotating register file support. Otherwise, they an also be handled as a software pipelining problem (Setion 6). 352

8 b 1 a Register (a) Without earlieststart heuristi. a 1 Register b (b) With earliest-start heuristi. Figure 4: Illustrating the earliest-start heuristi where R = 1 in this ase, is exeuted first before any operation. After a rotation ation, the lifetimes are simply mapped to the registers aording to their relative positions in the alloation. In Fig. 2e, we have annotated eah lifetime with its orresponding register index. 4.4 A Modulo Sheduling Algorithm Targeting Alias Register Alloation In Step 2 of the algorithmi framework (Setion 4.3), any modulo sheduling algorithm an be applied to alloate the rotating registers. The results are always orret. The same algorithm an be used to address both the traditional software pipelining problem and the new rotating register alloation problem, beause both ases have similar modulo property, and dependene and resoure onstraints. However, the two problems do have one important differene: the optimization objetives. For the alloation problem, the major optimization objetive is to minimize spills: spilling needs stati alias registers, whih are limited due to enoding spae onstraint, as disussed in Setion 2.2. Besides, one stati alias register used in the shedule may be expanded to multiple ones during the next ompiler phase (the ode generation) via Modulo Variable Expansion [1]. Our experiene is that in Transmeta proessors, there are usually few stati alias registers left after all other optimizations have been done and then software pipelining happens, and they an be quikly used up. Thus we should try using the more salable rotating alias registers, whenever possible. The next optimization objetive of alloation is to minimize register usage. However, we an sarifie this objetive for the major objetive, as long as the number of registers alloated does not exeed the number of available registers. That means we do not have to minimize R (the initiation interval). For our motivating example in Fig. 2, if we do not minimize R, we an have a different but still valid alloation shown in Fig. 3. For the traditional software pipelining problem, the optimization objetive is exatly the opposite: to minimize the initiation interval. The smaller the is, the faster the shedule runs. In fat, most of the existing studies on software pipelining, if not all, target how to minimize. With that differene in mind, here we propose a simple modulo sheduling algorithm speifially targeting rotating register alloation. It fits into Step 2 of the algorithmi framework in Setion 4.3. It has two steps: 1. Loal ompation. Shedule the lifetimes based on the resoure onstraints shown in Setion 4.1, and the loal dependenes in the dependene graph built in Step 1 of the algorithmi framework. Loop-arried dependenes in the graph are ignored in this step. This produes a shedule for the lifetimes in a single loop iteration. In this step, any loal sheduling method an be used. Say we use list sheduling, a well-adopted loal sheduling approah. It prioritizes the ready lifetimes into a list. A lifetime beomes ready when all its predeessors, i.e., the lifetimes on whih it depends loally, have been sheduled. For register i, where i starts from register towards +, list sheduling piks up the lifetime with the top priority from the list, and alloates the register to the lifetime. It an ontinue to pik up the other less prioritized lifetimes for this register as well, as long as they do not overlap in time. When there is no lifetime that an be fit into this register, it proeeds to the next register. How to prioritize the lifetimes is a key in this proess. We propose an earliest-start heuristi, whih prioritizes the one with the earliest start time. This heuristi minimizes spills. In Fig. 4, assume there are 2 operations whose exeution order in the pipelined shedule is a, b, and there is a missing loal dependene, whih requires that b should not hek a. The lifetimes are shown in bars in the figure. Lifetime b starts later than a. If we prioritize b, b will be alloated register, and a register 1. Then when operation b starts, it heks registers, starting from its own register. In that way, it will hek register 1, whih is for lifetime a. This is shown by the arrow in Fig. 4a. This hek violates our assumption that b should not hek a. However, if we adopt the earliest-start heuristi, lifetime a will be alloated first to register, and lifetime b is fored to be alloated to register 1. Then n not hek a. See Fig. 4b. With this heuristi, although a missing loal dependene is not added into the dependene graph, it is still potentially handled by the sheduling. This redues the hane of spilling later during Step 3 of the algorithmi framework. The experiments later will show that this heuristi does redue spills. In the ase of a tie when two lifetimes have the same start time, the one with the shorter length is given higher priority, in the hope of paking as many as possible lifetimes into the urrent register. 2. Calulate R. Imagine every loop iteration has the same shedule for its lifetimes. Overlap the shedules of two suessive loop iterations at a onstant pae R, onsidering the ignored loop-arried dependenes, as well as the olletive resoure usage of the overlapping iterations. This produes a more ompat shedule that an reuse registers between iterations. Register reusing, however, may ause false positives, as we disussed in Step 3 of the algorithm framework. To avoid this situation, we propose a max R heuristi: as long as the number of alloated registers does not exeed the number of available registers, use the maximum possible R. This does not affet the per- 353

9 Loop Sheduling Build dependenes between operations Modulo sheduling of operations (with JITSP) Rotating alias register alloation Build dependenes between lifetimes Modulo sheduling of lifetimes (with LCP, RS2, DESP, or JITSP) Remove potential false positives Register assignment Code generation Figure 5: The ompile flow. formane of the loop as the rotating register file has uniform aess lateny for any register. We all this approah Loal Compation followed by Paing, or LCP for short. 4.5 Complexity The algorithmi framework in Setion 4.3 has 4 steps. Let us analyze the omplexity step by step. In step 1, let N be the total number of memory operations in the software-pipelined shedule of a loop. Between a pair of may-alias operations, we may build at most C number of dependene edges, where C is the total number of loop iterations overlapped in the shedule, a onstant. Therefore, we have at most C N 2 edges, whih takes O(N 2 ) time to build. In step 2, there an be numerous modulo sheduling algorithms with various omplexities. For the LCP algorithm we propose, it performs list sheduling, and then alulate R. List sheduling is also a lass of algorithms with variable omplexities. Let E l and E be the number of loal and loop-arried dependene edges, respetively, and M the number of memory units in a proessor. One list sheduling method proposed by Rădulesu and van Gemund [17] takes O(N log(m) + E l ) time in our ase. We san loop-arried dependene edges to ompute R in O(E ) time. In step 3, we san the missing loal dependenes. Experientially, it takes virtually no time, as there are usually no or few suh dependenes, as will be shown in Setion 5. We also san eah pair of lifetimes to see if they inadvertently hek eah other. That takes O(N 2 ) time. The last step takes onstant time. In summary, the time omplexity of the algorithm is about O(N 2 + E ), plus that of list sheduling (if we use LCP algorithm in Step 2), whih may take an additional O(N log(m) + E l ) time. 5. EXPERIMENTS The algorithmi framework (Setion 4.3) for rotating register alloation, the LCP algorithm (Setion 4.4), along with min median mean max # nodes # dependenes # loal dependenes # loop-arried dependenes # missing loal dependenes maxlive Table 1: Charateristis of the dependene graphs of lifetimes %loops Ideal LCP DESP JITSP RS2 #rotating alias registers Figure 6: Cumulative distribution of the number of rotating alias registers min median mean max RS DESP JITSP LCP LCP without Max R heuristi LCP without Max R and earliest-start heuristis Table 2: Number of spilled lifetimes per loop iteration #registers % total alloated - loops maxlive 73.6% % > 5 8.5% (a) The distane of LCP from an ideal alloator #lifetimes spilled % total loops 96.% % >5.6% (b) Spilling in LCP per loop iteration Table 3: Additional statistis on LCP a few other general-purpose software pipelining algorithms inluding RS2 (a variant of Rotation Sheduling [6, 15]), DESP (Deomposed Software Pipelining [4, 9, 2]), and JITSP (Just-In- Software Pipelining, an in-house method), have been implemented in Transmeta Code Morphing Soft- 354

10 ware (CMS) [7] as part of a researh on software pipelining within dynami ompilers. RS2 rotates operations around the loop bak edge until a tight shedule is formed. DESP divides the sheduling proess into two steps to lower the sheduling omplexity: the first step finds a shedule respeting dependene onstraints only, and the seond step respeting only loal dependenes and resoure onstraints. JITSP is an in-house method that improves both RS2 and DESP in many aspets to make the sheduling proess onverge to optimal solutions quikly, and thus makes it feasible for dynami ompilers. The target arhiteture is a VLIW proessor similar to Transmeta Effieon [12], but besides the 14 stati alias registers, it also has a variable-sized rotating alias register file. The rotating alias register file has not been implemented in any ommerial proessor yet, and thus a produt-quality funtional simulator is used in the experiments here. The proessor translates X86 instrutions by the CMS inside into its internal VLIW instrutions, and then runs them. We perform experiments with the loops in SPEC2 benhmarks. The overall ompile flow is shown in Fig. 5. Our software pipelining module has 3 phases: sheduling, rotating alias register alloation, and ode generation. It first shedules the loop to expose parallelism. Then it alloates the rotating registers for the software-pipelined shedule. Finally, it generates ode. In the experiments, the first phase uses JITSP to generate optimal or near-optimal software-pipelined shedules rapidly. In the seond phase, for every pipelined shedule, we apply to it eah of the above algorithms, LCP, RS2, DESP and JITSP, in order to ompare their effetiveness for alloating the rotating registers 5. Note that JITSP is used in both phases, and the majority of its implementation is shared by them. There are only minor neessary differenes between them: Mainly, in the alloation phase, we need to enapsulate a lifetime as an operation, a register as a time step, and a time step as a resoure, in order to reuse the sheduling algorithm. It should be emphasized here that the sheduling and the alias register alloation phase are independent: the alloation phase proesses any shedule generated by any software pipelining algorithm, without favoring any of them. Although our alloation solution is proposed and tested in a speifi ontext, it is essentially independent of any sheduling algorithm, and is feasible for any ompiler, stati or dynami. The effetiveness of an algorithm is evaluated in the following aspets: given the same number of rotating registers, how many loops an be suessfully alloated all the registers they need? How many lifetimes, in a single loop iteration, are spilled afterwards during Step 3 of the algorithmi framework (Setion 4.3)? And how fast is the algorithm? To evaluate how good a register alloation is, we assume there is an ideal alloator, whih alloates exatly maxlive number of registers. MaxLive is the maximum number of lifetimes live simultaneously in a software-pipelined shedule. It is a lower bound of the number of registers to be alloated for the shedule. This is an ideal bound and may or may not be ahievable. Any alloation, even if it is optimal, needs at least maxlive registers. The loser the number of registers alloated is to maxlive, the better the alloation 5 The settings of RS2 are δ = 8 and ρ = 1, and at most 5 rotations are allowed. is. The experiments have two major results: 1. First, the experiments have validated the orretness of our problem formulation and solution. For a piee of X86 ode, our simulator an run the X86 ode and the VLIW instrutions generated by CMS from it, and periodially ompare the memory and register ontents between them. This is known as o-simulation. Any disrepany will be reported. We have not seen any disrepany so far. 2. Seond, the experiments show that LCP is usually the most effetive among all the algorithms. It generates ideal or near-ideal alloations for most loops. It minimizes spills very well, and it is faster than all the other algorithms. In addition, we also show that rotating alias registers are important to enable instrution-level parallelism to be exploited from loops. There are 11,825 software-pipelined loops for whih we perform rotating register alloation. Table 1 haraterizes the dependene graphs of the lifetimes. A graph has 2 to 8 nodes. The median and mean are 6 and 9.64, respetively. Sine the median is less than the mean, the number of nodes skews towards small numbers: the graph tends to have small number of nodes. This skewed distribution appear in the other harateristis as well. For example, a graph an have 1 to 3933 dependene edges, while the median and mean are 7 and 79.2, whih indiates that the graph tends to have small number of dependenes. Among the dependenes, most of them are loop-arried: on average, there are only 11.2 loal dependenes, but 68. loop-arried ones. Their maxima are even more strikingly apart: at most, there are 369 loal dependenes, but an have 3917 loop-arried ones. So in general, there are far less loal dependenes than looparried dependenes. There are usually few missing loal dependenes. Although a graph an have up to 96 missing loal dependenes, the median is, and the mean is.42. Those indiate that it is extremely rare to have any missing loal dependenes. Table 1 also haraterizes the distribution of maxlive. It ranges from 1 to 849, with a median of 3 and a mean of This suggests that usually, the number of lifetimes live simultaneously tends to be small. Fig. 6 shows the umulative distribution of the number of rotating registers. For a given number of rotating registers, it shows the perentage of the software-pipelined loops whose requirement of rotating registers an be met. All the algorithms, exept RS2, have lose-to-ideal results. For example, given 64 rotating registers, the perentages of the loops by an ideal alloator, LCP, DESP, and JITSP are 97.%, 95.9%, 94.1%, and 94.%, respetively. The differenes are small, but LCP is notieably better than the others, exept the ideal alloator. In ontrast, RS2 overs only 77.5% of the loops. The reason why RS2 performs signifiantly worse is that it aggressively shedules lifetimes in order to minimize R, the initiation interval. That is the main target of the generalpurpose software pipelining. As disussed in Setion 4.4, this is not neessarily good for register alloation. We observe from the experiments that it seems to be a general phenomenon: for this speifi register alloation 355

Learning Convention Propagation in BeerAdvocate Reviews from a etwork Perspective. Abstract

CS 9 Projet Final Report: Learning Convention Propagation in BeerAdvoate Reviews from a etwork Perspetive Abstrat We look at the way onventions propagate between reviews on the BeerAdvoate dataset, and