Object and Native Code Thread Mobility Among Heterogeneous Computers

Size: px

Start display at page:

Download "Object and Native Code Thread Mobility Among Heterogeneous Computers"

Antonia Brooks
6 years ago
Views:

1 Object and Native Code Thread Mobility Among Heterogeneous Comuters Bjarne Steensgaard Eric Jul Microsoft Research DIKU (Det. of Comuter Science) One Microsoft Way University of Coenhagen Redmond, WA Universitetsarken 1 USA DK-2100 København Ø Denmark Abstract We resent a technique for moving objects and threads among heterogeneous comuters at the native code level. To enable mobility of threads running native code, we convert thread states among machine-deendent and machine-indeendent formats. We introduce the concet of bus stos, which are machine-indeendent reresentations of rogram oints as reresented by rogram counter values. The concet of bus stos can be used also for other uroses, e.g., to aid insecting and debugging otimized code, garbage collection etc. We also discuss techniques for thread mobility among rocessors executing differently otimized codes. We demonstrate the viability of our ideas by roviding a rototye imlementation of object and thread mobility among heterogeneous comuters. The rototye uses the Emerald distributed rogramming language without modification; we have merely extended the Emerald runtime system and the code generator of the Emerald comiler. Our extensions allow object and thread mobility among VAX, Sun-3, HP9000/300, and Sun SPARC workstations. The excellent intra-node erformance of the original homogeneous Emerald is retained: migrated threads run at native code seed before and after migration; the same seed as on homogeneous Emerald and close to C code erformance. Our imlementation of mobility has not been otimized: thread mobility and trans-architecture invocations take about 60% longer than in the homogeneous imlementation. We believe this is the first imlementation of full object and thread mobility among heterogeneous comuters with threads executing native code. 1 Introduction A trend in distributed oerating systems has been to either suort communication and remote rocedure call [BN84] among heterogeneous comuters [BCL + 87, Gib87] or to suort object and thread/rocess mobility among homogeneous comuters [Jul89, Dou87]. We have combined the two by extending the Emerald system [BHJL86, BHJ + 87, Hut87, HRB + 87, JLHB88, Jul89, RTL + 91] to suort object and native code thread mobility among Work done while at DIKU, University of Coenhagen, Denmark. Coyright ( c) 1995 by the Association for Comuting Machinery, Inc. Permission to make digital or hard coies of art of all of this work for ersonal or classroom use is granted without fee rovided that coies are not made or distributed for rofit or commercial advantage and that new coies bear this notice and the full citation on the first age. Coyrights for comonents of this work owned by others than ACM must be honored. Abstracting with credit is ermitted. To coy otherwise, to reublish, to ost on servers or to redistribute requires rior secific ermission and/or a fee. Request ermissions from Publications Det, ACM Inc, Fax +1 (212) , or <ermissions@acm.org>. Sun3 workstation HP9000/300 workstation Ethernet SPARC lato SPARC workstation VAX workstation Figure 1: A network of heterogeneous nodes. Samle configuration of a local network with heterogeneous workstations among which we are able to move both objects and native code threads in our rototye imlementation. heterogeneous comuters connected in a local network as shown in Figure 1. In this aer, we describe the roblems encountered when enhancing a homogeneous object system with mobility to suort heterogeneous architectures. We resent the concrete techniques used in our imlementation and exlain how these techniques are secial cases of more general methods for maing rogram states among machine-deendent and machine-indeendent formats. We rovide erformance numbers for homogeneous and heterogeneous thread migration between four different architectures. By object mobility we mean that an object in an object based rogramming system is able to hysically change location within a set of rocessor nodes (in our case: workstations). Mobility is finegrained in the sense that individual objects, regardless of size, can move indeendent of other objects residing on the same rocessor node. Mobility is not restricted to mobility of entire address saces as in, e.g., the Srite oerating system [Dou87] and the DEMOS/MP oerating system [PM83]. By a thread we mean a light-weight thread of control running

2 (seudo-) concurrently with other threads within a single address sace. By thread mobility in an object system we mean that a thread is able to move among rocessor nodes. In the absence of object mobility, thread mobility is nothing but the ability to erform remote rocedure calls. In the resence of object mobility, an active thread may be executing an oeration in an object that is being moved. When an active thread is inside a moving object, the art of the thread state (activation/call stack) describing the thread inside the object must be moved with the object. Examle 1 Consider an object X residing on node A invoking a oeration in an object Y residing on node B, the effect of the oeration being that X is moved to node C. A remote rocedure call is erformed to invoke the oeration in Y. When the thread returns from executing the oeration in Y, execution has to resume on node C where X is now residing. The system has to move art of the call stack of the existing thread from node A to node C. 2 In our model of thread and object mobility, threads follow objects around as the objects are moved. The reason for this is that Emerald was designed for robust distributed comuting: node crashes are considered normal, exected events. We want to minimize residual deendencies [PM83], e.g., by co-locating threads with the objects within which they are executing. Our model differs from the model used in Oblique [Car95] where the objects are moved to where the threads are executing. Object and thread mobility among heterogeneous comuters is straightforward, if a system executes machine-indeendent byte codes and oerates on machine-indeendent data. However, the rice of this ainless migration is execution inefficiency due to interretation. Our goal, which we have achieved, is to offer object and thread mobility while retaining the local efficiency of rograms that comes from executing native code and oerating on machinedeendent data. Thus a thread should run no slower after migration than before and no slower than a comarable thread in a comarable homogeneous system. Furthermore, we rovide heterogeneous mobility without any language modification. Object and native code thread mobility among heterogeneous comuters is non-trivial because code on heterogeneous comuters may differ in the use of registers, number and tye of available registers, temorary values in registers or on the stack, instruction sets, rogram counter values, data formats, and different levels of otimizations. The roblems caused by differences in use of registers, number and tye of registers, temorary values in registers or on the stack, and data formats can all be solved by meticulously keeing track of where different values are laced in object data areas and activation record data areas on different latforms. Such meticulous tracking requires extensive comiler and runtime suort. However, this tracking is fundamentally no different than that which is required for suorting homogeneous object and thread mobility. Even in the homogeneous case, the comiler must roduce extensive information concerning the location and tye of variables that must be converted during the move [Hut87, Jul89]. The advantage of such extensive comiler suort is that node-local oerations are very efficient as efficient as comarable C rograms because the runtime overhead is restricted to actual migration oerations while non-migration oerations are not affected at all. The roblems caused by differences in instruction sets, rogram counter values, and levels of otimizations are non-trivial because there is no immediate way of translating from code on one architecture to code on another architecture. Two imortant observations oint at a ossible solution: 1) object and thread mobility is trivial if we execute machine-indeendent code and work on machine-indeendent data, and 2) when moving ordinary data such as numbers and strings among heterogeneous comuters we can 2 convert the data to a machine-indeendent format at the originating node and then translate from the machine-indeendent format to the machine-deendent format at the receiving node. If we can convert the native code and corresonding rogram counter values from the machine-deendent format used on a given architecture to a machine-indeendent format and vice versa then object and thread mobility among heterogeneous comuters becomes ossible. We introduce the concet of bus stos to reresent rogram counter values in a machine-indeendent manner. To demonstrate our ideas we have taken an existing object based system with migration, the Emerald system (see [JLHB88] 1 ), and enhanced it with suort for heterogeneous migration. The Emerald language includes constructs for secifying object mobility (and thereby also thread mobility). The language can be used without modification. Our enhanced rototye suorts object and native code thread mobility among VAX 2, Sun-3, HP9000/300, and Sun SPARC workstations. The imlementation is meant to demonstrate the viability of the concet of object and native code mobility among heterogeneous comuters; we have made no attemt to otimize inter-node erformance. However, while roviding native code mobility we retain the erformance advantage of executing native code; intra-node erformance on a given rocessor is indeendent of whether the thread was created on the rocessor or migrated to the rocessor, and is the same as on the original Emerald system, which suorts only homogeneous mobility. We have not attemted to further justify the need for heterogeneous mobility; it should be obvious that any homogeneous migration system can take advantage of transarently becoming a heterogeneous migration system. The rest of this aer consists of three arts. In Section 2 we discuss general issues related to object and thread mobility among heterogeneous rocessors. In Section 3 we describe our rototye imlementation and resent erformance numbers. In Section 4 we suggest future work on mobility among heterogeneous rocessors. Finally, in Sections 5 and 6 we discuss related work and resent our conclusions. 2 Mobility Issues for Heterogeneous Systems When discussing mobility of data and threads among rocessors, it is imortant to secify the characteristics of the data and the threads to be moved. It is, for examle, easy to imlement data and thread mobility among heterogeneous rocessors, if both data and thread states always are reresented in machine-indeendent format. Mobility is much harder, if data or thread states are reresented in a format tailored to a secific rocessor (e.g., native code) as is the case for most efficient systems. In this section, we describe imortant characteristics of machine-deendent data and thread states and in general discuss how to imlement mobility given a certain set of characteristics. 2.1 Migrating Data For erformance reasons, most systems use machine-deendent formats for ordinary data such as numbers, structures, strings, and vectors. For examle, most systems use the rocessor s native reresentation for integers (little or big endian) and floating oint numbers (IEEE or non-ieee). A notable excetion is Tcl [Ous94], which reresents all tyes of data as strings. If data is reresented in different machine-deendent formats on two different rocessors, mobility of data among the rocessors is tyically done by converting the data reresentation to and from a 1 Originally resented at SOSP 87 2 Unfortunately, our last VAX died during this roject so our erformance numbers are incomlete for the VAX case

3 commonly agreed uon format. For examle, in UNIX imlementations, it is common to convert 16 and 32 bit integers to network byte order before sending them over the network (e.g., when erforming a remote rocedure call) and converting them back to host byte order at the receiving end, e.g., using the htons(3), ntohs(3), htonl(3), and ntohl(3) library functions [Sun88]. Heterogeneous RPC imlementations usually also suort conversion of more comlicated data tyes [BCL + 87, Gib87, Sun84, IOS94]. It is also ossible for each rocessor tye to use its own machinedeendent format and then convert data between machine-deendent formats as required by each data transfer [SC88]. Unfortunately, the number of conversion routines required is quadratic in the number of data formats. Furthermore, suorting a new data format requires modifying existing systems by adding the necessary data conversion routines. 2.2 Migrating Threads Moving a thread really amounts to moving a thread state. The thread state is essentially comosed of a data comonent reresenting the values of local variables in the activation records on the call stack and a code deendent comonent consisting of the thread s executable code and ointers into this code (rogram counter values). Note, that in an object thread system, file descritors and similar oerating system data is usually reresented merely as references to other objects and so is not art of the thread state. The interesting art is the code deendent comonent because the data comonent of an activation recored is really no different from normal data which can be moved as described in the revious section. There are two imortant characteristics of the code comonent. The first characteristic concerns the executable code: is it machinedeendent or is it machine-indeendent? Native machine code is a tyical examle of machine-deendent code, while source code and byte codes are tyical examles of machine-indeendent code. The second imortant characteristic is whether or not the code at the originating rocessor has been subjected to the same transformations and otimizations as the code on the destination rocessor Migrating Machine-deendent Code using Bus Stos In this section, we discuss the roblem of migrating machinedeendent code and resent the concet of bus stos as an imlementation technique. If the code for a migrating object is machine-indeendent, e.g., byte code, the same code can be executed at both the originating node and the destination node. The issues related to mobility among heterogeneous rocessors is then no different than mobility among homogeneous rocessors and can be erformed as described in [JLHB88]. However, if the code is machine-deendent, e.g., native code, we cannot execute the same code on heterogeneous rocessors unless we imlement interreters of the various other machine-deendent formats on each tye of rocessor, which tyically is very inefficient. We must therefore have different versions of the code for execution on different tyes of rocessors. For examle, if we have a thread running on a VAX rocessor and want to move it to a SPARC rocessor, we need machine secific versions of the code for both tyes of rocessors. How we obtain these machine secific versions is irrelevant in this context. We see three levels of thread states as illustrated in Figure 2. The to level consists of thread states resulting from interretation of source code. The middle level reresents lower-level machineindeendent thread states resulting from execution of, for examle, byte code reresentations of rograms. The bottom level reresents machine-deendent thread states resulting from execution of, for examle, native code. Program execution lower in the hierarchy is tyically faster than rogram execution higher u. 3 Source code level Intermediate code level Byte code level Native code level Source? MI 1 MI =? MD 1 MD 2 Figure 2: The thread state secialization hierarchy. The MI forms are machine-indeendent forms, and the MD forms are machinedeendent forms. The solid arrows illustrate how comilation can statically secialize thread states. The dotted arrows indicate our dynamic transformations of thread states. The solid arrows illustrate how comiling can secialize rogram code for efficiency uroses. This transformation is erformed statically before the rogram is executed. The dotted arrows indicate how we imlement thread state mobility by transforming a machine-deendent thread state to a machine-indeendent thread state and secializing the result to a different but semantically identical machine-deendent thread state. Such thread state transformations are erformed during rogram execution when threads are moved. If the machine-deendent code is native machine code, likely differences between the codes include: non-isomorhic sets of registers, different use of registers, different activation record layout, different object code (different instructions), and rogram oints maing to different rogram counter values. The differences in available registers, use of registers, and layout of activation records are essentially only differences in data reresentation. If we have sufficient information about how registers and temorary variables in activation records are being used at each rogram oint, we can convert the call stack with all the relevant activation records to and from a machine-indeendent format. The differences in rogram counter values for the same rogram oint are slightly more troublesome. To move rogram counter values, we must comute rogram counter values on the destination rocessor that corresond to the rogram counter values on the origin rocessor. However, there may be rogram counter values for one tye of machine-deendent code that do not have corresonding rogram counter values in a different tye of machine-deendent code. Even given the assumtion that the same transformations and otimizations have been erformed on the different tyes of code, non-corresondence may haen when certain oerations are non-atomic on some rocessors. For examle, unlinking an element from a doubly linked list is atomic on the VAX rocessor but requires multile instructions on the SPARC rocessor. One way to avoid this roblem is to simly revent the mobility layer of the runtime system from ever seeing such rogram counter values. We say that the critical rogram counter values are made invisible and that the remaining values are visible. Such restriction of visibility be achieved in that multile ways. The Trellis/Owl system [SCB + 86] ermitted transfer of control to the runtime system at any time (e.g., by interruts), but if the transfer of control haened in a critical region, the to layer of the runtime system would execute by interretation the necessary number of instructions to exit the critical region before calling the lower layers of the runtime system that maniulated thread states.

4 Thus Trellis/Owl simly avoids having to deal with seeing rogram counters inside critical sections. We may consider any rogram counter value to oint into a critical region, if the rogram counter value does not have corresonding rogram counter values for (at least) one different architecture. By always interreting instructions until reaching a safe rogram counter value, the thread state maniulating arts of the runtime system (e.g., the arts imlementing thread mobility) will never see a rogram counter value that does not have corresonding rogram counter values in all other tyes of machine-deendent code. The Emerald system relies on its comiler to generate code that transfers control to the runtime system instead of having the runtime system reemt the threads [Jul89]. Control can only be transferred to the runtime system at suitable chosen oints: at system calls, oeration invocation entry (rocedure/function calls), and at the bottom of loos. If the same transformations and otimizations have been erformed on all tyes of machine-deendent code then choosing these oints ensures that the runtime system only sees rogram counter values that have corresonding rogram counter values for different tyes of machine-deendent code. In Emerald, this technique is also used to rovide the garbage collector with well-defined states for easy ointer identification ([JLHB88, JJ92, Juu93]). We can enumerate all the rogram counter values that have corresonding rogram counter values in different tyes of machinedeendent code. The number of such rogram counter values will be the same on all rocessor tyes. We can erform this enumeration such that it is consistent across the different tyes of rocessors. The assigned numbers then uniquely secify rogram oints indeendent of the tye of code being executed. These numbers can therefore be used as machine-indeendent secifications of rogram oints. We use the metahor bus stos to describe the enumerated rogram oints. There may be many different sequences of basic oeration that can be erformed between bus stos; we do not really care about all these different ways, as we only ever sto at the bus stos. A comiler is free to reorder and otimize between bus stos. Bus stos can be considered safe migration oints where any native code generator must ensure that the thread state can be translated to and from a machine-indeendent form. Given a set of bus stos, the code generator is free to otimize code between bus stos in any way, as the otimization transformations are not visible to the runtime system. In this resect, bus stos are related to the synchronization oints of ANSI C [Ame89]. Comiler suort is necessary to generate both the information needed to describe machine-deendent use of registers and temoraries in activation records and the bus sto information. No change to the generated code is necessary. A considerable amount of data conversion has to be erformed by the runtime system when moving machine-deendent thread state. Mobility of machine-deendent thread state among heterogeneous rocessors is likely to be more exensive than mobility of machine-indeendent thread states. However, the advantage of converting between machine-deendent and machine-indeendent formats is that native code erformance can be achieved while thread states are not being moved. The solution therefore aears accetable when intra-node runtime erformance is more imortant than thread mobility erformance. Even when thread mobility erformance is imortant, our unotimized imlementation of heterogeneous thread mobility is accetable in many cases: it takes only 60% longer to erform a thread move as on the version suorting homogeneous mobility. In this section, we have resented our concet of bus stos and described how this technique can be used to achieve heterogeneous thread mobility while allowing for comiler otimizations between bus stos. In the next section, we discuss allowing comiler otimization across bus stos. 4 abstract o1; o2; o3; switch(); o4; o5; o6; code1 o1; switch(); o2; o3; o4; o5; o6; code2 o2; o5; switch(); o4; o1; o3; o6; Figure 3: Bridging Code Examle: Examle of how a machineindeendent code sequence (abstract) may be otimized in two different ways by code motion (code1 and code2) Dierences in Otimization Orthogonal to the issue of machine-deendent vs. indeendent code, is whether or not the code at the source rocessor is transformed and otimized in exactly the same way as the code at the destination rocessor. Even given homogeneous machines, ossible differences in transformations include code motion to change lifetimes of values, strength reduction, etc. Thread mobility is fairly easy using bus stos, if no visible rogram counter value oints into the code in question. So one way to allow mobility among differently otimized codes is to only ermit code transformations between visible rogram oints. However, this is likely too restrictive, allowing only small ee-hole otimizations. Modern comiler techniques often result in more general code reorganization. In this section we describe how to enable thread mobility in the resence of general code transformations between the source and destination codes. In contrast to the techniques described in the revious section, the techniques described in this section are not backed u by a rototye imlementation demonstrating the validity of the techniques. However, the issues are worth considering, and we believe our suggested techniques to be alicable. Many different tyes of rogram transformations and otimizations exist. For now, we will only consider various tyes of code motion transformations, data changing otimizations such as strength reduction in loos, and RISCifying transformations, relacing a comlex oeration with several simler ones 3. Generalizations are ossible but are outside the scoe of this aer. By code motion we mean reordering of instructions that may occur on a given ath through the rogram. From the ersective of thread state mobility, code motion may have the effect that instructions are moved around a otentially visible rogram oint. The instructions may have side effects, so it is imortant that the instructions are executed exactly once. One way to overcome code motion differences between different comiled instances of the code is to build bridging code between the origin and destination instances of the code. The different instances of code can be viewed as the suer-blocks of Trace Scheduling [Fis81]; the bridging code is then equivalent to the entry aths to and the exit aths from the suer-block, and the bridging code can be constructed using similar techniques. Examle 2 Consider the code sequences shown in Figure 3. The leftmost code sequence is the unotimized code sequence handed to the backend of the comiler. The two other code sequences are examles of how the original code sequence can be modified by code motion transformations. Assume that code1 is art of the code for an object to be moved, and the rogram counter value corresonding to the switch() oeration is visible. The rogram counter may be visible, if switch() 3 RISCification is common in, e.g., comilers for the Pentium rocessor [Int94] where only a subset of simle instructions may be executed simultaneously with other instructions in the rocessor s other execution ieline.

5 code1 o1; switch(); - o2; o3; o4; o5; o6; bridge o2; o4; o5; - code2 o2; o5; switch(); o4; o1; o3; o6; Figure 4: Examle of bridging code necessary to change from using the code sequence code1 to using the code sequence code2 at the switch() in code1. is either a rocedure call or a system call. The object is to be moved to a rocessor where code2 is to be used instead of code1. Because of the code motion transformations, there is no direct corresondent in code2 to the visible rogram oint in code1 (the rogram oint is not a bus sto). Therefore, we must generate bridging code to overcome the differences. Figure 4 illustrates the bridging code necessary to overcome the differences between the code sequences code1 and code2. Oeration o1 has already been executed at the time control reaches the switch() oeration in code1. There is a bus sto at oeration o6 in both codes, at which oint we can start executing the instructions from code2. Before doing so, we have to ensure that oerations o2, o3, o4, and o5 are executed exactly once. Oeration o3 can be executed in code2. To execute the remaining oerations, we generate a new code fragment containing o2, o4, and o5. After o5, the code fragment jums to o3 in code2. The rogram counter value at switch() in code1 is translated to the rogram counter value indicating o2 in the new code fragment. 2 Code motion can be imlemented by a very small set of rimitive oerations on control flow grahs. Assume that the otimization hase of the comiler is given an initial control flow grah. We can then dulicate the control flow grah and create links between identical rogram oints in the two versions of the grah. If the otimization hase of the comiler otimizes one of the two versions of the control flow grah by the rimitive code motion oerations, each such oeration can automatically generate the bridging code in both directions between the original and the otimized control flow grah. If the rimitive code motion oerations are all reversible, reversing the sequence of code motion oerations and erforming the reverse of each code motion oeration on the otimized version of the control flow grah will yield the original control flow grah. Given the otimized code, the original code, the bridging code between the two, and a secification of how to construct the bridging code from the original code (in terms of rimitive code editing oerations), it is ossible to imlement thread state mobility among rocessors executing code that has been subject to different code motion transformations. Assume that a thread has been temorarily halted at a certain rogram oint in the otimized code on the originating rocessor. The rogram oint can be secified by two comonents: 1) how to create the bridging code to the original, unotimized code, and 2) the oint in the original code reached by the bridging code. At the destination rocessor, the bridging code from the visible rogram oint (at the source node) to the original code can be constructed using the set of rimitive editing oerations from 1). We then aend to that, the bridging code from the reached rogram oint in the original, unotimized code to the otimized code on the destination rocessor. The result is bridging code from the otimized code on the origin rocessor to the otimized code on the destination rocessor. By making the thread start executing the bridging code on the destination rocessor, we ensure that each oeration is executed exactly once (as it should be) and that the 5 thread eventually will execute otimized code on the destination rocessor if it is not migrated while still executing the bridging code. Examle 3 The bridging code shown in Figure 4 could be generated by first generating bridging code from code1 to abstract shown in Figure 3 and then generating bridging code from abstract to code2. The bridging code from code1 to abstract consists of oerations o2 and o3. Bridging from abstract to code2 removes o3 and inserts o4 and o5 in the bridging code. 2 The thread state may, of course, be moved once more before it has finished executing the bridging code. This is not a roblem, if we either avoid bus stos in bridging code or, more generally, if we retain the descrition of how the bridging code was constructed (in terms of rimitive editing oerations). Bridging code from bridging code can be constructed the same way as the bridging code between the original and the otimized control flow grahs. Strength reduction in loos is an otimization that not only requires transformation of code but also requires transformation of data in the thread state. If the comiler rovides a comlete descrition of the transformation, we can convert the thread state data as necessary while constructing bridging code between the different tyes of otimized code. Instruction selection is a very fundamental oeration during code generation. Assume the comiler backend is given a control flow grah reresentation of the rogram. Some of the oerations in the control flow grah may erhas be imlementable by single machine code instructions on the rocessor we generate code for. Other oerations (e.g., unlinking an element from a doubly linked list) may not be imlementable by a single machine instruction. These more comlex oerations may be relaced with a sequence of other oerations, which may be imlemented by single machine instructions. Relacing a comlex instruction with several simler instructions may also be desirable for RISCification uroses. In the context of thread state mobility, it is roblematic if a visible rogram counter value indicates a rogram oint where some, but not all, of the instructions resulting from the instruction selection for a single oeration have been executed. If the oeration that results in multile instructions on the originating rocessor only results in a single instruction (or in multile different instructions) on the destination rocessor, there is no direct corresondence between oerations in the different codes. Again, a ossible solution is to generate bridging code based on instruction selection information generated by the comiler. As mentioned in the beginning of this section, the issue of mobility of threads between (rocessors with) code otimized in different ways is orthogonal to the issue of mobility between heterogeneous rocessors. They are indeendent system dimensions. 2.3 System Dimensions and Comiler Suort There are three system dimensions that are imortant when considering object and thread mobility among heterogeneous rocessors: 1. Machine-deendent vs. machine-indeendent data, 2. Machine-deendent vs. machine-indeendent code, 3. Existence or non-existence of codes that visibly have been transformed or otimized differently. If data is reresented in a machine-deendent form at any time on any tye of rocessor, comiler suort is required to enable object mobility. The tye of information required is tyically limited to structure layout and the tyes of the values ket in the structure slots. Such information is usually also required by symbolic debuggers. Tyically, only a small amount of extra information is required to suort mobility.

6 To enable thread mobility when threads are executing machinedeendent code, additional comiler suort is required. The usual debugging information will tyically be sufficient to describe most of the data comonent of the thread state. Extra comiler suort may be necessary to describe the use of temorary values at each bus sto. Comiler suort is also necessary to associate rogram counter values with bus sto numbers. The necessary additional comiler suort is similar in extent to the usual debugging information. To enable thread mobility when the executed code at origin and destination rocessors may be otimized in different ways, extensive comiler suort is required. The backend of the comiler must generate information comletely describing the transformations erformed during code generation. Also, the backend of the comiler must be tied closely to the runtime system for the urose of dynamically generating the necessary bridging code. Whereas the first two system dimensions only require comiler suort to enable the runtime system to transform the machine-deendent format into a machine-indeendent format, the ossible existence of codes otimized in different ways requires the runtime system to be able to invoke arts of the comiler at runtime. 3 Imlementing Heterogeneous Mobility in the Emerald Prototye We have an Emerald rototye imlementation of our ideas which shows that object and thread mobility is ossible among heterogeneous rocessors, even if the rocessors oerate on machinedeendent data and have machine-deendent thread state. The executed code on all tyes of rocessors has, however, been subject to exactly the same otimizations, so the rototye only demonstrates the solution techniques for the first two of the three system dimensions identified in the revious section. The rototye is an extension of the Emerald rogramming system, which originally suorted object and thread mobility among homogeneous rocessors [JLHB88, Jul89]. In the following subsections, we will discuss the goals for our rototye imlementation (Section 3.1), the features of the original Emerald imlementation that are relevant for this aer (Section 3.2), what changes we had to make to the Emerald comiler (Section 3.3), changes to the comilation rocess (Section 3.4), and what changes we had to make to the Emerald runtime system (Section 3.5). In the final subsection (Section 3.6) we describe our exerience with the rototye imlementation. 3.1 Design Goals for Our Imlementation The original goal of the Emerald rojects was to demonstrate that object and thread mobility was ossible without sacrificing the runtime erformance obtained by executing native machine code. This goal was achieved. The goal for our rototye is the same, with the addition that object and thread mobility must be ossible among heterogeneous rocessors as well as among homogeneous rocessors. For the rototye imlementation, we did not want to focus on the erformance of the runtime system when erforming object and thread mobility. The urose of the rototye was to rove ossible the concet of native code thread mobility among heterogeneous rocessors. Also, we did not find it imortant to retain the existing erformance of mobility among homogeneous rocessors. Previous work has shown that multile rotocols can be used for RPC in a heterogeneous environment to avoid the extra overhead of converting data to a machine-indeendent format (network format) when erforming RPC between homogeneous rocessors [SC88]. The extra effort to do this was considered unimortant for demonstrating our oints Features of the Original Emerald System The original Emerald system suorts both object and thread mobility among homogeneous rocessors not using distributed shared memory [JLHB88]. The rocessors are workstations connected in a local network. Fine-grained objects can be extracted from the address sace of one rocessor and moved to another rocessor. All activation records describing invocations of methods in the moved objects are moved along with the objects, thereby imlementing thread mobility. The original Emerald system suorts object and thread mobility on networks of homogeneous workstations of one of the following four tyes: VAXen running BSD-Unix or Ultrix, Sun-3s running SunOS, HP9000/300s running HP-UX, and Sun SPARC workstations running SunOS. All data in Emerald consists of objects. Objects may refer to objects on other workstations. It is transarent to the rogrammer (modulo erformance) whether or not a given object resides on the same rocessor as an object containing a reference to the object. In the imlementation, references are object identifiers, OIDs, uniquely identifying objects regardless of their location. The only way threads can share data is by having references to the same objects. Since references are network transarent, threads may move indeendently of each other. The Emerald comiler generates so-called temlates which describe objects and activation records in sufficient detail to enable the runtime system to erform the necessary ointer swizzling and to udate the distributed synchronization data structures when objects are moved from one rocessor to another. The temlates do not distinguish between different forms of simle data, i.e., integers, floating oint values, strings, etc. The Emerald calling conventions include callee-saved registers, and the temlates for activation records include sufficient information to distinguish registers holding ointer values from registers holding simle data and to find ointers in the callee-saved register area. Only one temlate is used to describe activation records for invocations of a articular method. Both registers and slots in the activation record structure may be used to hold values of different tyes over the lifetime of the activation record, but the comiler ensures that a given slot will only hold either simle data or ointers throughout the lifetime of the activation record. The initial design of the Emerald system allowed for multile temlates for each activation record, each temlate being valid for a certain range of rogram counter values. Initial exeriments found that multile temlates could be avoided by a combination of careful comiler design and the bus stos technique [Jul89]. Aart from the temlate information necessary for the runtime system to suort mobility among homogeneous rocessors, the Emerald comiler also generates debugging information for use by a symbolic debugger. The debugging information identifies the exact locations and tyes of both global object variables and local variables. Object code is encasulated in code objects identified by OIDs. Code objects are immutable objects and can therefore by moved to another rocessor by dulication. Localization and mobility of code objects are erformed by the same mechanisms erforming localization and mobility of all other tyes of objects. The Emerald runtime system only ever sees a restricted set of rogram counter values. From the runtime system s ersective, the object/user code is resonsible for transferring control to the runtime system by system calls. The comiler is resonsible for generating code that transfers control to the runtime system when necessary 4. Transfer of control is erformed by a system call. 4 An interrut handler can reset the stack limit ointer to indicate to the user code that control must be transferred to the runtime system. Checks for available stack sace are erformed by the user code at rocedure calls and at the bottom of loos. The code sequence for method invocations must check for stack sace availability anyway, so

7 The only rogram counter values visible to the runtime system are therefore at method invocations (the rogram counter values being return addresses stored in activation records), at the bottom of loos, and at system calls in the user code. Thus the original Emerald used a simle version of the bus sto technique. 3.3 Changes to the Emerald Comiler To enable thread mobility among heterogeneous rocessors, the comiler must generate information about bus stos, activation records layout, object layout, etc. It is not necessary to make any change to the generated machine code. The visible rogram counter values in the original Emerald system fulfill all the requirements of bus stos. A bidirectional maing between rogram counter values and bus sto numbers is needed by the runtime system to convert rogram counter values to bus sto numbers and vice versa. To generate the bus sto maing, we changed the backend rocedures for generating the rocedure call sequences and system call sequences to add entries to the maing. While the debugging information generated by the Emerald comiler is sufficient to identify the location of all local variables, it does not secify which variables are dead or alive at a given rogram oint. Consequently, it does not secify which of otentially many variables are currently stored in a register or activation record slot shared by multile variables. Also, the temlate information does not indicate the number and tyes of temorary variables live at a given rogram oint. The temlate information must therefore be augmented with information for each bus sto on which variables currently own shared locations and the number and tyes of temorary variables in use. The code for adding an entry to the bus sto maing catures and saves information about the number and tyes of live temorary variables and which local variables own shared registers or slots in the activation record at that rogram oint. With one excetion, these were the only changes necessary to the Emerald comiler! The VAX rocessor can erform unlinking of a doubly linked list as an atomic oeration. The Motorola rocessors (used in Sun-3 and HP9000/300 workstations) and SPARC rocessors cannot erform the unlinking as an atomic oeration. As unlinking is used to imlement monitors [Hoa74] in Emerald, a system call is required to ensure the atomicity of the unlinking oeration. The bus stos on all tyes of rocessors must be isomorhic to each other. We therefore have to add an entry to the bus sto maing for each unlink instruction on the VAX rocessor, even though no system call is erformed at that oint in the code. This bus sto is an exit only rogram oint meaning that conversion from the bus sto number to the rogram counter value may be necessary, but not the other way around. Again, no changes are made to the code to be executed; we only need to generate temlate information on the side describing the rogram oint. 3.4 Changes to the Emerald Comilation Process Aside from the changes to the comiler described in Section 3.3, we also made two crude changes to the comilation environment short-cuts which in a roduction system would be relaced by more suitable comiler modifications. The roblem is that we need to have several comiled versions of a rogram: one for each architecture. For our rototye, we chose a rimitive solution: the rogrammer simly comiles the rogram once on each architecture. However, for the imlementation to function correctly, it is necessary that the unique object identifiers, OIDs, are the same for all versions of the rogram. most of the user code olls are free. Roughly the same method is used to imlement signals in Standard ML of New Jersey [Re90]. 7 In a homogeneous rocessor environment, the same object code can be used on all workstations, and it makes sense to assign object code OIDs. In a heterogeneous rocessor environment, only object code comiled for a secific tye of workstation can be used. It is therefore no longer sufficient to describe object code with an OID and exect the mobility subsystem to fetch the correct code. It is, however, desirable to retain the OIDs to describe the semantic content of a code object. We want the OIDs to be the same for semantically equivalent code objects generated for different tyes of workstations. We therefore need to augment OIDs with another mechanism to distinguish code objects. We did not add such a mechanism to our rototye. In the rototye imlementation, the rogrammer has to manually set the OID counter to ensure synchronization. While this is clearly imractical, the lack of this feature is not necessary to rove ossible the concet of mobility among heterogeneous rocessors. In a fully functional imlementation, the comiler has to maintain synchronization between OIDs for isomorhic object codes for different rocessors. One ossible method for doing so is to use a rogram database. When a file is comiled, the rogram is stored in the rogram database. When it is subsequently comiled on another rocessor, information from the rogram database is used to ensure that exactly the same OIDs are used. Storing rograms in a rogram database would also allow the runtime system to automatically invoke the comiler to generate rocessor secific object codes in case the rogrammer had not manually comiled the rogram for all the used tyes of workstation. In our rototye, the rogrammer manually has to ensure the availability of object codes for all the tyes of workstations that an object may ossibly move to, i.e., comile the rogram once on each tye of workstation. When a node receives an object for which it does not have any code, it searches for the code, first by checking on its disk, thereafter by searching the network. We use NFS (SUN Network File System [SGK + 85]) to create the illusion that the object code always resides in the local disk reository. When the kernel needs object code with a secific OID it thus gets the correct version from disk instead of from another (otentially heterogeneous) Emerald kernel. 3.5 Changes to the Emerald Runtime Kernel The changes to the Emerald runtime kernel fall in two different categories: addition of rocedures to convert to and from network format, and changes to the marshalling and unmarshalling code. Conversion of ordinary data structures to and from network format is erformed by a set of hand-written conversion routines. The code is not otimized for seed but for easy of maintenance. Comosite data structures are converted by recursive descent of the structure. Deending on the rocessor tye, 2 3 rocedure calls are erformed to convert a simle integer value to or from network format. Conversion of rogram counter values to and from bus sto numbers are erformed using the bus sto tables generated by the comiler. New table looku routines were necessary to erform the conversion, as we wanted to extract the associated information about local variables and temoraries at the same time. Marshalling of data structures for transort over the network already existed in the original Emerald system. The marshalling code was instrumented to convert the data structures to and from network format as art of the marshalling rocess. An additional layer of marshalling was necessary to convert activation records to and from a machine-indeendent format. We invented a new activation record format and used that as the machineindeendent format. The new activation record format stored all local variables in the activation record rather than in registers. The comiler generates sufficient temlate information to enable the runtime kernel to convert the machine-deendent activation records

8 to and from the machine-indeendent activation records. While concetually simle, this transformation requires a fair amount of work and is easy to get wrong. A machine-deendent activation record may require more or fewer bytes than the corresonding machine-deendent activation record, deending on the nature of the rocedure (method) in question. This created an extra roblem at the destination rocessor, as the temlate and debugging information required us to translate the youngest activation records first. Since we could not know beforehand the size of the machine-deendent activation record stack (thread fragment), we could not erform an initially correct lacement of the activation records in the allocated thread stack sace. We therefore had to do a relocation of all activation records within the allocated stack sace after the conversion to the machine-deendent format. 3.6 Exerience with and Performance of the Prototye The rototye has been imlemented. It can move objects and threads among all four tyes of workstations. The rototye has been tested on a small number of test cases. It is cumbersome to ensure code object OID synchronization manually (as described in Section 3.4). In a roduction system, this roblem is readily solved by a rogram database. Our additions to the Emerald system have not affected the code generated by the comiler. Intra-node comutations will therefore execute exactly the same instructions (and in the same order) on the original Emerald system and our enhanced Emerald system. The intra-node runtime erformance (comarable to that rovided by C comilers [Hut87] and on the SPARC architectures at times even better [Mar92]) should therefore in theory be exactly the same on both systems. Measurements on both systems verify this trivially. Thus we have achieved one of our major goals: to rovide heterogeneous thread mobility at the same time as reserving node-local erformance. In contrast, there is a rather large difference in erformance for inter-node comutations between the original Emerald system and the enhanced Emerald system. Table 1 shows the relative costs of moving a small thread (13 variables in the fragment being moved) among hosts in the original and the enhanced system. The cost is given for moving a thread from a machine of one architecture to a machine of a different architecture and back, for a total of two thread moves. The time costs for moving a thread X! Y! X are the same as the cost for moving a thread Y! X! Y. The SPARC machines are SarcStation SLC workstations with 20MHz SPARC rocessors and 16MB RAM running SunOS The Sun3 machine is a Sun3/100 workstation with 16MB RAM running SunOS We only have one Sun3 machine left, so we cannot include timings for Sun3$Sun3 thread mobility. We no longer have two identical HP9000/300 workstations, so we have instead used the two we had that were most similar in terms of erformance. HP9000/300 1 is an HP Aollo 9000 Series 400 Model 433s with 72MB RAM. It is based on a 33MHz MC68040 rocessor. HP9000/300 2 is an HP Aollo 9000 Series 300 Model 385 with 64MB RAM. It is based on a 25MHz MC68030 rocessor. Both workstations are running HP-UX 9.0. We unfortunately lost our last VAX workstation shortly after getting the enhanced Emerald system u and running 5. The erformance figure for mobility among VAX workstations on the original Emerald system was obtained on VAXstation 2000 workstations running Ultrix 2.1. The workstations are all connected by a 10Mbit/s Ethernet network. We attribute the greater art of the difference in erformance to our inefficient imlementation of the routines to convert simle data structures between machine and network format. An average Systems Original Enhanced SPARC$SPARC 40 ms 63 ms 57% SPARC$Sun3 N/A 122 ms SPARC$HP9000/300 1 N/A 52 ms SPARC$HP9000/300 2 N/A 57 ms SPARC$VAX N/A N/A Sun-3$Sun-3 65 ms N/A Sun-3$HP9000/300 1 N/A 109 ms Sun-3$HP9000/300 2 N/A 113 ms Sun-3$VAX N/A N/A HP9000/300 1$HP9000/ ms 44 ms 57% HP9000/300 1$VAX N/A N/A HP9000/300 1$VAX N/A N/A VAX$VAX 79 ms N/A VAX$VAX 48 ms 81 ms 68% Table 1: Thread Mobility Timings: Time costs of moving a small thread (13 local variables in the fragment being moved) among various tyes of machines on the original Emerald system and the enhanced Emerald system. Each cost is for moving the thread from one machine to another machine and back for a total of two thread moves. Results marked N/A are not available because our last VAX has died and we have only one Sun-3 left. The last line showing the cost for VAX$VAX thread migration is set aart, as the numbers are for migration of a smaller thread from a different rogram between more modern VAXen than VAXstation 2000 workstations running an earlier version of our rototye. of 1 2 calls of conversion rocedures are erformed for each byte being transferred over the network even when ignoring the cost of converting activation records to and from a machine-indeendent format. We believe the erformance enalty of the enhanced system would be much less if we used more efficient routines for conversion of simle data structures. We have not erformed any exeriments to clarify how much of the erformance enalty is caused by the way the conversion routines are imlemented. Therefore, we can only guess that we could reduce the erformance enalty by 50% by using more efficient routines. In summary, we do ay for the ability to do cross architecture invocations and thread and object mobility, but we retain the full advantage of native code comilation, namely that alications run at full seed locally. 3.7 Source Code The source code for our rototye imlementation is available on the SOSP 95 CD and via either of the addresses shown in Figure 5. Note, that there are two imlementations of Emerald: one as described here which we now call UC-Emerald and a newer but non-distributed byte-coded version called BC-Emerald which has a comiler written in Emerald. For further information refer to the addresses listed in Figure 5. 4 Future Work Our rototye imlementation demonstrates that object and thread mobility among heterogeneous rocessors is ossible. Our imlementation is however inefficient and imractical to use. An obvious next ste is of course to erform an otimized imlementation, so the true overhead of converting object and thread states to and from network formats can be measured. 5 We have unfortunately reserved only reliminary erformance figures from when the enhanced system was running on VAX workstations. 8

An Indexing Framework for Structured P2P Systems

An Indexing Framework for Structured P2P Systems Adina Crainiceanu Prakash Linga Ashwin Machanavajjhala Johannes Gehrke Carl Lagoze Jayavel Shanmugasundaram Deartment of Comuter Science, Cornell University