ACE: And/Or-parallel Copying-based Execution of Logic Programs

Size: px

Start display at page:

Download "ACE: And/Or-parallel Copying-based Execution of Logic Programs"

Avis Aileen Henry
5 years ago
Views:

ACE: An/Or-parallel Copying-base Execution of Logic Programs Gopal GuptaJ Manuel Hermenegilo* Enrico PontelliJ an Vítor Santos Costa' Abstract In this paper we present a novel execution moel for

1 ACE: An/Or-parallel Copying-base Execution of Logic Programs Gopal GuptaJ Manuel Hermenegilo* Enrico PontelliJ an Vítor Santos Costa' Abstract In this paper we present a novel execution moel for parallel implementation of logic programs which is capable of exploiting both inepenent an-parallelism an or-parallelism in an efficient way. This moel extens the stack copying approach, which has been successfully applie in the Muse system to implement or-parallelism, by integrating it with proven techniques use to support inepenent an-parallelism. We show how all solutions to non-eterministic anparallel goals are foun without repetitions. This is one through recomputation as in Prolog (an in various an-parallel systems, like &-Prolog an DDAS), i.e., solutions of an-parallel goals are not share. We propose a scheme for the efficient management of the aress space in a way that is compatible with the apparently incompatible requirements of both an- an or-parallelism. We also show how the full Prolog language, with all its extra-logical features, can be supporte in our an-or parallel system so that its sequential semantics is preserve. The resulting system retains the avantages of both purely or-parallel systems as well as purely an-parallel systems. The stack copying scheme together with our propose memory management scheme can also be use to implement moels that combine epenent an-parallelism an or-parallelism, such as Anorra an Prometheus. 1 Introuction Recently, stack copying has been emonstrate to be a very successful alternative for representing múltiple environments in or-parallel execution of logic programs. In this approach, stack frames are explicitly copie from the stack(s) of one processor 1 to that of another whenever the latter processor nees to share a branch of the or-parallel tree of the former. In practice, by having an ientical logical aress space for tlaboratory for Logic, Datábase, an Avance Programming, Dept. of Computer Science, New México State University, Las Cruces, NM, USA. * Faculta e Informática, U. Mari (UPM), Mari - Spain. ^Dept. of Computer Science, University of Bristol, Bristol, UK. Throughout the paper we will often refer to the "stack" of a "processor" meaning the memory áreas that a computíng agent is using. all processors an allocating the stack(s) of each processor in ientical locations of this aress space, the copying of stack frames can be reuce to copying large contiguous blocks of memory from the aress space of one processor to that of the other an operation which most current architectures perform quite efficiently without requiring any sort of pointer relocation. The chief avantage of the stack copying approach is that program execution in a single processor is exactly the same as in a sequential system. This consierably simplifies the builing of parallel systems from existing sequential systems, as was shown by MUSE [2, 1] which was built using the sequential SICStus Prolog System. Similar arguments can also be mae for the esign of inepenent an-parallel systems base on program annotation (i.e. using Conitional Graph Expressions) an recomputation of subgoals [7] (i.e. noneterministic an-parallel subgoals are recompute an not share), as proven by the experiences of &- Prolog [16] an DDAS [23]. Briefly, a program annotate for inepenent an-parallelism contains expressions of the form,(< conitions > => HteraliSz <kliteral n ), where literali,..., literal n will be execute in (an-) parallel only if the < conitions > are satisfie. A long staning goal of parallel logic programming systems esigners has been to obtain more general systems by combining ifferent forms of parallelism into a single framework. In particular, one woul expect that inepenent an-parallelism an or-parallelism, that have been exploite so well in Prolog, coul naturally be exploite together. In fact, this is a har problem, as the ifficulties (e.g. supporting full Prolog) face by several previous proposals [14, 26, 22] o show. Recently an abstract moel, calle the Composition Tree [12], has been esigne that allows efficient realization of systems that combine both forms of parallelism while supporting full Prolog. In this paper we esign a novel moel, a realization of the C-tree, exploiting or- an inepenent an-parallelism, which subsumes both the stack copying approach (for orparallelism) an the subgoal recomputation approach (for an-parallelism). The resulting an-or parallel system, calle ACE, is in the same category as PEPSys [26], ROPM [22] an the AO-WAM System [14]. However, our system is arguably better than the above systems in many re-

spects, the chief ones being ease of implementation, sequential efficiency, an better support for the full Prolog language, in particular being able to incorpórate sie-effects in a more elegant way.

2 spects, the chief ones being ease of implementation, sequential efficiency, an better support for the full Prolog language, in particular being able to incorpórate sie-effects in a more elegant way. These avantages are ue to several factors. One of them is that ACE recomputes inepenent an-parallel goals, rather than sharing their solutions (solution sharing was aopte in all the previously propose moels [14, 26, 22]). Recomputation means that, given a goal a(x) & b(y), where the two subgoals a an b are inepenent, the solutions for the subgoal b will be recompute for every solution foun for subgoal a. Recomputation has important avantages (they are are iscusse at length in [12, 25]), an was funamental in the esign of the C-Tree moel. In [12] we presente the C-tree, along with a few preliminary ieas on how to realize the C-tree using an environment representation technique base on stack copying as well as bining arrays. In this paper we show how the complete inepenent an- an or-parallel system base on C-Tree can be constructe using stack-copying. ACE subsumes both MUSE [2] an &-Prolog [16] in terms of execution behaviour. One of our aims is to have ACE subsume performance characteristics of MUSE an &-Prolog as well, namely, their low parallel overhea, their consierable speeups for interesting applications, an their support for the full Prolog language. To accomplish this we nee to carefully aress the many issues that arise in combining both forms of parallelism. These issues are: Synchronization between inepenent an-work an or-work: that is, eciing when shoul the alternatives create by goals working in inepenent an-parallel be mae available for or-parallel processing. In ACE we lay own a set of sharing requirements that a choicepoint shoul satisfy before a processor can pick an or-parallel alternative from it. Memory management: for or-parallel execution in MUSE the stacks of one processor shoul not be visible to the other processors (except uring copying), while in inepenent an-parallel execution in &-Prolog the stacks of one processor shoul be visible to all other processors. In ACE processors are organize into teams to get aroun these conflicting requirements for or- an inepenent an-parallelism respectively. Scheuling: ACE can use the existing scheulers of MUSE an &-Prolog for scheuling or- an inepenent an-parallelism respectively. However, in aition, it shoul also balance the amount of resources allocate for exploiting or- an inepenent an-parallelism. Efficient implementation of copying: while MUSE copies stacks of a single processor at a time, ACE nees to copy stacks of múltiple processors. Therefore, eveloping optimize copying techniques is even more funamental for ACE. Implementation of Prolog's extra-logical features (such as cuts an sie-effects): Both MUSE an &-Prolog have evelope techniques for supporting full Prolog. In ACE we nee to exten these techniques to support sequential Prolog semantics in presence of both or- an inepenent anparallelism. Here we can benefit from the principies esigne for the C-tree abstraction [10]. In esigning the solutions to these problems, our aim is to obtain a full Parallel Prolog system that will have low sequential overheas an goo parallel speeups. Also, we try to follow the techniques that have been use for &-Prolog an MUSE as much as possible, as they have proven to be effective in practice. Our perspective so far of ACE is as concretizing the C-tree framework for combining inepenent an- an or-parallelism using stack copying. Once ACE is fully escribe, it will be apparent that ACE can be seen in quite a ifferent perspective. In this new perspective, ACE generalizes the principie of copying, from the copying use in MUSE to obtain or-parallelism between sequential computations, to copying to obtain or-parallelism between an-parallel computations. In the paper we show that this principie, Generalize Copying, not only gives a way to unerstan ACE, but it applies, an shoul be useful, to combine orparallelism with many forms of an-parallelism, such as parallelism between eterminate an-goals as exploite in Basic Anorra [6], or with epenent an inepenent an-parallelism as exploite in DDAS [23]. The paper is organize as follows. We first present the ACE moel. Although the C-tree abstraction is implicit to our reasoning, it is not neee for unerstaning the rest of the paper. We use the stack copying approach to give a more intuitive feel for the moel. We then enter the more specific problems of memory management, an how copying can be implemente between stacks sets. We give a brief overview of the new scheuling problems that arise, an present an iscuss two schemes for the optimization of copying in ACE. We also propose a scheme to support cut an sie-effects in ACE. We finally iscuss the effectiveness of ACE an show how our scheme can be generalize to epenent an-parallel systems. Throughout the paper, we assume some familiarity with the implementation of Prolog, &-Prolog, an MUSE. Like in &-Prolog, we assume that programs are annotate to express the an-parallelism using basic Conitional Graph Expressions (CGEs) before execution commences. The &-Prolog parallelization tools [20] will be use to automatically genérate such annotations from stanar Prolog coe. Alternatively,

3 programs can always be annotate by the user. 2 2 The Stack Copying Approach to An-Or Parallelism In ACE, the múltiple environments that are neee to implement or-parallelism are supporte through explicit stack copying. We first summarize the stack copying approach (as use by the MUSE system). In a stack-copying or-parallel system several processors explore ifferent alternatives in the search tree inepenently (moulo sie-effect synchronization). The execution of each processor is ientical to a sequential Prolog execution. Whenever a processor Pl exhausts its branch an wants to share work with another processor P2 it selects an untrie alternative from a choice point cp in P2's stack. It then copies the entire stack of P2, backtracks up to the choice point cp in orer to uno all the conitional binings mae below cp, an starts executing one of the untrie alternatives. In this approach, provie there is a mechanism for copying stacks, the only cells that nee to be share uring execution are those corresponing to the choice points. Share choice points are thus copie from P2's private memory to share memory where they can be accesse from both Pl's an P2's private memory via pointers 3 (these choice points are sai to have been mae public, following MUSE's terminology). If we consier the presence of an-parallelism in aition to or-parallelism, then, epening on the actual types of parallelism appearing in the program an the nesting relation between them, a number of cases can be istinguishe. The simplest two cases are of course those where the execution is purely or-parallel or purely an-parallel. Trivially, in these situations stanar or-parallel an inepenent an-parallel execution respectively apply, moulo the memory management issues, which will be ealt with later. We next iscuss the interesting cases where both forms of parallelism are present in the computation. 2.1 An "Uner" Or An "uner" or refers to cases where or-parallelism present insie an-parallel goals is not exploite [25]. Thus, only alternatives in those choice points that are not neste insie any CGE, i.e. not create uring processing of an-parallel goals, are mae available for or-parallel processing. The two cases are, first the 2 In &-Prolog unrestricte epenency graphs can be expresse (i.e. more general than those possible with CGEs), by combining the "&" operator an synchronizationbuiltins. However, since such graphs can be hanle in a similar way to that given in the escription that follows, the iscussion will be limite for simplicity an without loss of generality to CGEs. To be precise, share choice points are not copie but a recor representing the choice point is create in the share área. case in which the goal that gives rise to or-parallelism is not precee by any CGE; an secon, the case in which this goal is in the continuation (but not insie) of some CGE. The first case is illustrate in Figure 1. Consier the tree, shown in Figure l.(i), that is generate as a result of executing a query a which uring its execution calis a clause containing a CGE (true => b(x) & c(y)). In Figure l.(i) processor Pl has starte execution of goal a, left an untrie alternative ("embryonic branch") a2, an then entere the CGE. Anparallel execution can remain ientical to the stanar subgoal recomputation approach (like in &-Prolog), henee a processor P2 can simply pick up the execution of goal c. Or-parallel execution can also remain ientical to puré stack-copying. If processor P3 wants to pick up the a2 branch left behin by Pl, it can simply copy the portion of the tree from the root to the embryonic noe, an continué with the untrie alternative (Figure l.(ii)). This resembles a stanar stack-copying execution (as in MUSE). Figure l.(iü) an Figure l.(iv) present the secon case, when a processor selects an untrie alternative from a choice point create uring execution of a goal gj in the boy of a goal which oceurs after a CGE. In other wors, there has been an-parallelism above the selecte alternative, but all the an-tasks are finishe. A processor selecting such an alternative will have to copy all the stack portions in the branch from the root to the CGE, the portions of stacks corresponing to all the an-tasks insie the CGE, an those of the goals between the en of the CGE an gj. All these portions have in principie to be copie because the untrie alternative may nee access to variables in all of them. In Figure l.(iii), processor Pl starte execution of the goal creating a CGE (b & c), an fully executes b. Processor P2 execute the goal c in an-parallel. Both have finishe execution of the CGE (leaving no choice points behin) an then processor Pl has taken the continuation an left an untrie alternative 2. This alternative can be picke up by another processor P3. The processor P3 has therefore to copy the portion of the tree from the root to the CGE, the portions insie the CGE, an the portion of the continuation up to the embryonic noe. The processor P3 can then start execution of the 2 alternative (Figure l.(iv)). 2.2 Or "Uner" An In "Or Uner An" the untrie alternatives of choice points create within an-parallel goals in CGEs are also mae available for or-parallel processing. One coul simplify, an isallow or-parallel processing of such alternatives, trying them sequentially via backtracking instea, but there is experimental evience that a consierable amount of or-parallelism may be lost [25]. Therefore, ACE oes support oruner-an parallelism. When an alternative create within an-parallel goals in a CGE is selecte, one

4 P3,a P3 a ( & e) i el el (i) (ü) branch execute locally copie branch O bl (iv) (b & c) \ \cl 2 embryonic branch (untrie alternative) en of an-parallel goal, beginning of execution of continuation of CGE Figure 1: An Uner Or nees to carefully ecie which portions of the stacks to copy. Our guiing principie is the following: copy all branches that woul be copie in an equivalent or-parallel (MUSE in this case) execution, an recompute all those branches that woul be recompute in an equivalent puré an-parallel computation. As far as the an-parallel execution is concerne, we want to be as cióse as possible to the recomputation approach henee implementing the PWAM "point backtracking" strategy [19] use in &-Prolog. As we will see, our strategy results in copying only parts that &-Prolog reuses uring backtracking an recomputing those that &-Prolog (an also MUSE an Prolog) recomputes. Consier a CGE (true => g\!k...gi...<kg n ) that is encountere uring execution, an whose goal < r has an untrie alternative in one of the choice points in its search tree. Assume a processor picks up this untrie alternative for or-parallel processing. Then this processor will have to copy all the stack portions in the branch from the root to the CGE incluing the CGE escriptor 41 (calle C'-noe in [12] an parcall frame in &-Prolog [18]). It will also have to copy the stack portions corresponing to the goals g\... <7 _i (i.e. goals to the left of < r ). The stack portions up to the CGE nee to be copie because each ifferent alternative within gi might prouce a ifferent bining for a variable, X, efine in an ancestor goal of the CGE. The stack portions corresponing to goals g\ through gi-\ have to be copie because execution of the goals fol- The CGE escriptor recors the control information for the CGE an its inepenent an-parallel goals for exploiting anparallelism. Figure 2: Execution tree with alternatives insie the CGE lowing the CGE may nee to access some of the binings generate by the goals g\... <7 _i. The stack portions corresponing to the goals gi+i...g n nee not be copie, because these goals woul be recompute. The issue is further illustrate with a simple example. Figure 2 shows the an-or tree for the query q containing a CGE (true => a(x) & b(y)), each of whose goals leas to two solutions. For sake of simplicity, we have only shown the path from root of the tree to the CGE. Execution in ACE begins with processor Pl executing the top level query q. When Pl encounters the

5 Pl P2 P3 P4 P5 P6 (a & b) (a & b) (a & b) (a & b) & ^ S / & al bl bl al (i) (ii) a2 (iii) b2 (iv) a2 b2 branch execute locally - /"^V embryonic branch * \J (unt rie alternative) copie branch. choice point (branch point) Figure 3: Or Uner An CGE, it picks the subgoal a for execution, leaving b for some other processor. Let us assume that processor P2 picks up goal b for execution (Figure 3.(i)). As execution continúes Pl fins solution al for a, generating a choice point along the way. Likewise, P2 fins solution bl for b. Since we allow for full or-parallelism within anparallel goals, a processor can steal the untrie alternative in the choice point create uring execution of a by Pl. Let us assume that processor P3 steals this alternative, an sets itself up for executing it (before P3 can steal the alternative, Pl has to move the choicepoint into the share área). To begin execution of this untrie alternative P3 copies the stack of processor Pl (Figure 3.(ii) shows this process; see Ínex at the bottom of Figure 3 for explanation of the symbols). P3 then simulates failure to remove conitional binings mae below the choice point, an restarts the goals to its right (i.e. the goal b). Processor P4 picks up the restarte goal b an fins a solution, bl, for it. In the meantime, P3 fins the solution a2 for a (see Figure 3. (ii)). Note that before P3 can commence with the execution of the untrie alternative an P4 can execute the restarte goal b, they have to make sure that any conitional binings mae by Pl after the selecte choice point, as in Muse, an that any binings mae by P2 while executing b have been cleare. The former can be implemente by either (i) P4 copying b from P2 an completely backtracking over it 5 ; or, (ii) P3 (or P4) getting a copy of the trail stack of P2 an resetting all the variables that appear in it (see later). At this point, two copies of b are being execute in or-parallel, one for each solution of a. Note that the process of fining the solution bl for b leaves a choice point behin. The untrie alternative in this choice point can be picke up for execution by another processor. This is inee what is one by processors P5 an P6 for each copy of b that is executing. These processors copy the stack of P2 an P4 respectively, up to the choice point. The stack portions corresponing to goal a are also copie (Figures 3.(iii), 3.(iv)) from processors Pl an P3, respectively. The processors P5 an P6 then procee to fin the solution b2 for b. Note that if there were no processors available to steal the alternative (corresponing to solution b2) from b then this solution woul have been foun by processors P2 an P4 (in their respective copies of b) through backtracking as in &-Prolog. The same woul apply if no processors were available to steal the al- An optimization coul be that P4 choses not to backtrack over b or recompute it again, rather P4 simply copies b an reuses it. This optimization is only vali if b has not yet generate a solution (or at least, execution of the continuation of the CGE, which may bin variables conitionally in b, shoul not have begun). Some problems may also arise with extra-logical preicates in b, an in general only the part before such an extra-logical preicate can be copie into P4.

ternative from a corresponing to solution a2. In the above example, all other operations that are performe uring an-parallel execution remain the same as in &-Prolog.

6 ternative from a corresponing to solution a2. In the above example, all other operations that are performe uring an-parallel execution remain the same as in &-Prolog. Thus, execution of the continuation of the CGE can begin only after at least one solution has been foun for all goals in the CGE. Also, backtracking in the CGE takes place just as in &-Prolog, i.e. goals to the right shoul be completely explore before a processor can backtrack insie a goal to the left. We place a restriction (calle the sharing reqmrement) on choice points insie a CGE that can be mae available for or-parallel processing: given a goal < r in a CGE, choice points arising in it can be mae available for or-parallel processing only if the goals to the left of <7 in that CGE have reache a solution. If the CGE containing < r is neste insie another CGE, then all goals to the left of the goal leaing to the inner CGE shoul also have foun a solution, an so on. Thus, in the example above (Fig. 3(i)), the alternative b2 of b cannot be picke up by any team 6 until the solution al has been foun. The sharing requirement serves two purposes: (i) as far as or-parallel scheuling is concerne it keeps us very cióse to the scheuling strategy employe by MUSE; (ii) it avois (a form of) speculative or-parallelism, because if the goal to the left (for which a solution ha not been foun yet) faile, the work woul have gone waste. We coul go one step further, an stipulate that choice points insie CGEs will be mae available only if all goals in the CGE have foun at least one solution. Although this will keep us closer to &-Prolog an enable us to o a limite form of intelligent backtracking (the kin that is also present in &-Prolog), this will overly restrict the amount of or-parallelism. So this restriction is not aopte, although its lack might result in extra work in some situations. For instance, in the example above (Figure 3), if b were a failing goal, i.e. a goal without any solutions, then trying múltiple alternatives of a in or-parallel woul result in waste work: b's failure woul be iscovere múltiple times since b is recompute for every alternative of a. 3 Memory Management in ACE One of the main features of stack-copying base orparallel systems which greatly facilitates stack copying is that each processor has an ientical logical memory aress space. This enables one processor to copy (part of) the stack of another without relocating any pointer. In the presence of an-parallelism this feature may be har to ensure, as each goal in a CGE The iea of viewing a set of workers as a team will be analyze in etails later on. may be execute in an-parallel by a ifferent processor. In other wors, as far as an-parallel execution is concerne, all the participating processors shoul work on sepárate segments of a common aress space, whereas for or-parallel execution each processor shoul have an ientical but inepenent logical aress space (so that stack portions can be copie without any pointer relocation). Thus, the requirements for or- an an-parallelism seem to be antithetical to each other. The problem can be resolve by iviing all the available processors into teams such that an-parallel work can only be share between processors of the same team, an or-parallel work can only be share between teams. All processors in a team thus share the same logical aress space, but each team has its own inepenent logical aress space (which must be ientical to the aress space of all other teams to allow copying without any pointer relocation). To implement an-parallelism, the aress space of each team is ivie up into k memory segments (as happens in &-Prolog), where k is the máximum number of processors allowe in any given team. The memory segments are numbere from 1 to k. Each processor of the team allocates its stack set (heap, local stacks, trail etc.) in one of the segments. The sizes of the k ifferent memory segments in the aress space of a team are not require to be the same. However, once one team's aress space has been ivie into segments using some scheme for ivisión, the aress spaces of all other teams shoul be ivie into segments in an ientical way, so that uring copying of stacks no pointer relocation is neee 7 (Figure 4). Processors belonging to other teams are allowe to join a ifferent team as long as there is a memory segment available for them to allocate their stacks in the new team's aress space. Consier the simple scenario where a choice point, belonging to a team TI an outsie the scope of any CGE, is picke by a team T2. Let i be the memory segment number in TI in which this choice point lies. For simplicity, we assume that the root of the Prolog execution tree also lies in memory segment i. T2 will thus copy the stack from the th memory segment of TI into its own th memory segment. Since the logical aress space of each team is ientical an is ivie into ientical segments, no pointer relocation is neee. Failure is then simulate an the execution of the untrie alternative of the stolen choice point begun. Now consier the more interesting scenario where a choice point, create by a team TI an which lies within the scope of a CGE, is picke up by a processor in a team T2. Let this CGE be (true => g\!k...8 g n ) an let < r be the goal in the CGE whose sub-tree con- This constraint may be relaxe quite a bit, since ientical ivisión of aress space nees to be one only for those teams that will share computatíon, an then only for the parts that are share.

$T2 nees to copy the stack segments corresponing to the computation from the root up to the CGE an the stack segments corresponing to the goals g\ through < r.$

7 0...0 f...f Proc 1 Proc n Segment 1 Segment m Team 1 aressing space f...f Proc n+1 Proc 2n Segment 1 Team 2 aressing space Segment m f...f Proc p-n+1 Proc p Segment 1 Team t aressing space Segment m Choice Point Parcall Frame Figure 4: Aress Space in Muse tains the stolen choice point. T2 nees to copy the stack segments corresponing to the computation from the root up to the CGE an the stack segments corresponing to the goals g\ through < r. Let us assume these stack segments lie in the niemory segments of team TI numbere ii,...,i. They will be copie, at the same position, into the niemory segments numbere ii,...,i of team T2. (section 7 escribes a strategy for incremental copying). Failure woul then be simulate on < r. We further nee to remove the conitional binings mae uring the execution of the goal <7 + i... g n by team TI. Let ik+i... be the stack segments where < r _ _i...g n are executing in team TI. As before, we can either copy the trail stacks of these segments an reinitialize (i.e. mark unboun) all variables that appear in them an then iscar the copie trails, or we can copy the stack segments corresponing to goal <7 + i... g n themselves in the appropriate niemory segments of T2 an then backtrack over them. Once removal of conitional binings is one the execution of the untrie alternative of the stolen choice point is begun. The execution of the goals < r _ _i...g n is reinitiate (since we are following a recomputation approach) an these can be execute by other processors which are members of the team (some of this re-computation can be avoie, as mentione earlier). Note that whereas copie stack segments occupy the same niemory segments as the original stack segments restarte goals can be execute in any of the available niemory segments (clearly if T2 ecies to copy the computations one by team TI for goals gi+i through g n to save recomputation or for untrailing, as men- Figure 5: Illustration of Stack Copying tione earlier, then the corresponing stack segments will have to be copie in the same niemory segments, i.e. ik+i through i, of T2). Returning to the earlier example (fig. 3), for execution to procee as shown there, each pair of processors (Pl, P2), an (P3, P4) woul have to be in the same team (respectively teams TI an T2). Each of processors P5 an P6 will also have to be in a sepárate team (respectively teams T3 an T4). Assuming that Pl starts the execution of query q in niemory segment numbere 1, an P2 starts the execution of b in niemory segment numbere 2 (in the aress space of team TI), then P3 woul be force to copy the stack segment corresponing to a in niemory segment number 1 of its aress space. Assuming that only the trail stack of b is copie (to reset conitional binings), P4 is free to execute b in any niemory segment of T2 (which will be a segment other than niemory segment 1, because only one processor in a team can use a niemory segment at a time). Suppose P4 has its stacks locate in segment 4 of the aress space of team T2; then, it will execute b in niemory segment number 4. When P5 an P6 steal the alternative corresponing to solution b2 then each of them will copy stack segments corresponing to a to niemory segment 1 of their respective aress spaces, an the stack segment corresponing to b to niemory segments 2 an 4 of their respective aress spaces. The copying of stacks by team T2 from team TI corresponing to figures 3.(i) an 3.(ii) is further illustrate in figure 5. To keep the figure simple only

8 the local stacks are shown. In reality, the heap an the trail will be copie too. Also note that copie choice points are transferre to a share área to which the choice points in the local aress space now point, as in MUSE. The share memory área is not shown in Figure 5. Note that because goals in a CGE are recompute, parcall frames an any other structure use to support an-parallelism (as the various markers use by the PWAM [18]) are copie rather than share (see Figure 5) Note also that each memory segment in a team's aress space has a complete set of stacks for a processor to work on corresponing to the "stack set" of &-Prolog [16]. Thus, the segmente memory management propose can also be viewe as each team having a number of stack sets on which ifferent processors ("agents") can work on. This view allows the immeiate application of stanar memory management techniques evelope for inepenent an-parallelism [17] within each team. This leas to a layering of the parallelism exploitation in ACE: at the lower layer, within each team, the computation is purely an-parallel, as in a group of "stack sets" in &-Prolog to which a number of "agents" are attache; at the higher layer, among the teams, the computation is purely or-parallel (as in MUSE). Thus, it is easy to see that in the presence of only an-parallelism our system woul be as efñcient as &-Prolog, while in the presence of only orparallelism it will be as efñcient as the MUSE system. Also notice that the amount of stack copying that will be one, in the presence of both an- an or- parallelism, woul be ientical to that one in the MUSE system provie, of course, that the scheuling strategy is the same. However, the set up time for executing the untrie alternative choice points that fall within the scope of a CGE may be ifferent than in MUSE, ue to the spreaing of the computation across ifferent stacks. On the other sie, the actual copying operation may result even faster than in MUSE, since ACE can take avantage of having múltiple processors transferring in parallel parts of the computation tree. Note that these ifferences only appear when both an- an orparallelism are being exploite simultaneously. An interesting property of ACE also relate to memory management is that it aapts quite naturally to a hybri multiprocessor in which parts of the aress space are share among subsets of processors, as, for example, in a system containing múltiple share-memory multiprocessors connecte by a message passing or broacast local network [4]. In this kin of system each share-memory multiprocessor woul naturally be a caniate for constituting a team with its own memory aress space, an the various teams woul then be sprea over the ifferent multiprocessors an communicate by the message-passing or broacast local network. An-parallelism woul be exploite within the share-memory system while or-parallelism woul be exploite over the istribute network of these share memory systems. The argument above has been base on locality of aressing space issues, but a perhaps even more important factor involve is that of the access time epening on location. It also makes sense from this point of view to keep processors in a team, which communicate more often, within the fast communication área an put ifferent teams, which communicate less often, at a larger istance from the point of view of communication. Similar principies apply an a similar approach can be taken for implementing an-or parallel systems on general NUMA (Non Uniform Memory Access) Machines even if they have a global aressing space. 4 Scheuling The nee to scheule work arises at two inepenent levéis: (i) an-parallel work at the level of processors within a team an, (ii) or-parallel work at the level of teams. Thus a processor can steal anparallel work only from members of its own team an an ile team can steal an untrie alternative from a choice point. This suggests that sepárate scheulers can be use for managing the an-parallel an orparallel work respectively. Scheulers evelope for &-Prolog an for MUSE can be use for this purpose. For example, an-parallel work can be manage exactly as in the &-Prolog scheuler: ile processors steal available goals from goal-stacks of other processors in their team. Or-parallel work can be manage as in MUSE: ile teams request work from other teams (as in MUSE, it will be convenient to share as much work as possible). A istinction can be mae between the public part an the prívate parí of the execution tree: the choice points in the public part have been mae sharable, while those in private part have not been mae sharable yet. Execution in a team continúes normally as in an an-parallel system (as escribe above), until the team is interrupte by another that is looking for work. At this point all choice points in the private part that satisfy the sharing requirement are mae sharable or public. The requesting team picks an alternative from one of these choice points for or-parallel processing. Finally, one has the new problem of balancing the number of teams an the number of processors in each team, in orer to fully exploit all the an- an orparallelism available in a given Prolog program. In orer to solve this problem ynamically processors can migrate from one team to another or start new teams. An ile processor first looks for an-parallel work in its own team. If no an-parallel work is foun, it can ecie to migrate to another team where there is work, provie (a) it is not the last remaining processor in that team, an (b) there is a free memory segment

9 in the memory space of the team it joins. If no such team exists it can start a new team of its own, perhaps with ile processors of other teams, provie there is a free aress space available for the new team. The new team can now steal or-parallel work from other teams. Some of the 'flexible scheuling' techniques [8] that are being evelope for eciing when a processor shoul switch teams etc. in the Anorra-I system can also be use in ACE. 5 Implementation of ACE The iscussion so far has aime at proviing a general, high-level escription of the ACE execution moel. In this section we will present a number of practical issues which arise in the implementation of the moel an propose a number of solutions for efficiently resolving them. The two main issues we stuy are relate to memory management an how efficient copying through only copying parts of the stack sets between teams can be implemente, an how full Prolog shoul be supporte in ACE. (Further etails on implementation of ACE, that are not inclue here for lack of space, can be foun in [13]). 5.1 Goal Execution Orer in CGEs Memory management is a complex problem in the implementation of parallel logic programming systems, one that is closely relate to scheuling [17]. Memory management is simplifie in MUSE because each processor manipulates a sepárate Prolog stack set. In contrast, in ACE a team manipulates múltiple stack sets that may have to be copie when teams fetch work from other teams. Furthermore, epening on an-scheuling, only parts of such stack sets may be neee: the orer in which stack frames are pushe on the processor's stack may not obey the orer in which they woul have been pushe in a sequential Prolog implementation, an thus a stack segment may contain "trappe" stack frames (actually, whole "stack sections") that are not part of the computation surrouning it [17]. As a result of this, when copying stack segments we may copy sections that are unrelate to the branch we nee. We can completely avoi copying these non-relevant parts, but then many small fragments of the stack will have to be copie making the copying operation somewhat inefficient[13], an, in any case, the hole create by the trappe goal woul remain in the copying stacks because copying is aress-to-aress to avoi pointer relocation. Incremental copying is also mae ifficult by this potential lack of orer. We explain these practical issues in more etail next. Ieally, we woul prefer that a parallel stack-base system implementing Prolog semantics woul obey the seniority constraint: Given two stack frames, /i an 2, corresponing to two noes in the Prolog search tree, then /i shoul be allowe to appear above j' (there might be other intervening noes between /i an j' ) if an only if /i will appear above j' in the stack in stanar sequential execution of Prolog. Thus frames of escenent noes in the execution tree must appear on top of frames of their ancestors. The reason why this is helpful is that, while backtracking, if we reach a frame / then we know that / is on top of the stack an that the frames corresponing to all its escenents have been backtracke over an reclaime, thus consierably simplifying memory management. If the seniority constraint is not obeye, then holes may appear in the stack both in an- an or-parallel systems, an "trappe goals" may appear in an-parallel systems [17]. In fact, the seniority constraint may impose severe constraints on parallel systems. Enforcing this constraint for an inepenent an-parallel system such as &-Prolog (or ACE) may severely constrain the way an-parallel goals can be scheule. Given a CGE (true =>a&b&c) ifa processor picks the goal c for an-parallel processing, then following this constraint will effectively shut it out from picking any goals (after it has finishe processing c) to the left of c, or goals from CGEs create uring execution of a or b, or goals from ancestor CGEs that are to the left. The seniority constraint is obviously too severe in this case an inee systems such as &- Prolog [16] an Aurora [9] o not obey it. Rather they let holes (an trappe goals, in the case of &- Prolog) be create in the stack, that will be reclaime when everything above the hole gets reclaime (see Figure 6). This creates many problems in ACE, because now when we copy a stack set, we may copy many trappe goals that are not part of the current alternative stolen. These trappe goals may nee to be ientifie before execution can begin in the copying team. This problem an our solutions are further iscusse in the next section in the context of techniques for incremental copymg of stacks Incremental Copying in ACE An optimization that significantly improves the performance of stack-copying or-parallel systems, like MUSE, is incremental copying, i.e., when a processor copies a stack of another, then only those parts are copie in which the two processors iffer. This is illustrate in Figure 7 (only local stacks are shown). Suppose processor Pl is working on branch 1, an P2 on branch 2. At this point both Pl an P2 have a common stack up till the branch noe a (moulo conitional binings). Suppose now that after exploring branch 2, P2 ecies to pick an alternative from Pl Note that goal-recomputation, as use in ACE, actually helps in maintaining seniority constraints, because every time we recompute goals, we execute them on top of existing solve goals that are to the left, thus righting the orer somewhat.

10 choicepoint Processors Pl, P2 an P3 are processing the an-or tree shown on the left. Processor P3 picks goal from Pl for an-parallel processing, fins solution i, an starts helping P2 to solve the goals in its CGE by picking up goal h. The goal is now trappe uner goal h, because it woul be backtracke over first. P1 b i j i j&k b P2 9 i g P3 k i h i h Similarly, k is picke by P3 after finishing hl, which inturn traps h. Note that b&c& etc. in the stacks enote the Parcall frame for that CGE. b&c& a g&h&i c i Figure 6: Trappe Goals in An-parallel Execution (along branch marke 3) in noe b. To o so it backtracks up to noe a an steals the secon alternative from b in Pl. Therefore, before P2 can procee, it nees to créate on its stacks the state that existe in Pl at the time the choicepoint corresponing to b was create. To o so it copies Pl's stacks. The copying an restoring of state can be one in three ways [3] (Figure 7): i optiniize incremerital _ Figure 7: Incremental Copying in MUSE i ' P2 (i) Total: copy the entire stacks of Pl (everything from the root to the bottom most noe along branch 1), then backtrack until choicepoint b is reache. Thus, the hatche, gray, an the black shae segments of Pl's stack in Figure 7 will be copie; (ii) Incremental: copy only frames below choice point a (those above a are alreay on P2's stack), then backtrack until choice point b. Thus, only the gray an black segments in Pl's stack in Figure 7 will be copie); (iii) Optiniize incremental: copy only the stack segments between choice points b an a because those above a are alreay on P2's stack, an those below b are not neee for execution. Thus, only the gray shae portion of Pl's stack in Figure 7 is copie. The exception is that the entire trail stack below a is copie, so that the parts of the trail stack below choice point b can be use for removing conitional binings. Clearly, option (i) involves unnecessary copying 9 because there are copie parts that are immeiately Experiments on the Sequent Symmetry have shown that for memory chunks larger than 4K the copying time is proportional to the size of the memory chunk being copie [13].

11 backtracke over an reclaime. Option (ii) also oes unnecessary copying, unless the black shae part of the stack in Figure 7 is very small in size. In MUSE the ifference between Incremental an Optimize is almost irrelevant, since in most of the cases there will be harly anything on the stack below the choice point (assuming the stack is growing ownwar as in Figure 7) from which the new alternative is taken. This is a consequence of the scheuling policy aopte by MUSE, in which alternatives are always taken from the bottommost choice point (known as ispatchingon-bottommost [2]). In ACE, however, things are ifferent ue to the presence of an-parallelism. Referring again to the an-or tree shown in Figure 6, suppose that a team, T2, was working on alternative g2 of goal g in the inner CGE (which it stole from a team, TI, earlier). It fins a solution, looks for more work, an ecies to pick an alternative h.2 from noe h (corresponing to solution gi from TI). T2 an TI have a common stack up to the CGE labele (g & h & i). The stack frames leaing up to choicepoint g are also present in both. Applying the iea of incremental copying, T2 will have to copy the ifference between TI an itself. As before, there are two ways of copying incrementally: (i) blinly copying the ifference between corresponing stacks (of ifferent processors of the two teams) on T2's stacks (Incremental Copying); (ii) copying only those parts which will be useful for T2, i.e. leaving out the parts that will be immeiately backtracke over (e.g. the frames corresponing to hi, i, ci an i) copying only the trail for such parts. (Incremental Copying is illustrate in Figure 8). While in MUSE Incremental copying (rather than Optimize incremental copying) results in very little space being copie that gets immeiately backtracke over, in ACE this may not be the case, as ACE supports an-parallelism an follows the sharing reqmrement to make noes public. Consier the following scenario: suppose as before that T2 tries to steal alternative h.2 from choicepoint h, an that TI ha not yet foun a solution for goal i. In this case, TI will not make the choice points of all the branches to the right of i public (that is, choice points create from the goal in the example). This is for two main reasons. Firstly, an as mentione when presenting the sharing requirement, work available from these choice points will be very speculative, as i may yet fail (possibly after computing for a long time) an all the work in copying these branches an picking work from them may therefore be waste. Secon, making these choice points public will lea to mixing of public an private parts of the logical search tree 10. For instance, if execution of i (or the CGE's continuation) in TI was to lea to further choice points, they will initially be private an henee not be visible to other teams, although choice points of goals to the right of i (such as ) will Note that they are alreay mixe in the physical stacks. be public. Thus, uring normal backtracking through a CGE private choice points might be encountere after a processor has backtracke into the public área (or worse yet, if another team steals alternative from an later backtracks, it will not see parts of i at all since they were never copie because of being private to TI). This mixing of public an private áreas of the logical search tree, thus, will result in complications in scheuling. Henee, choice points in goals to the right of an incomplete goal in a CGE are never mae public. As a consequence of this, Incremental copying will en up copying all the private goals, that may form a large part of the tree, an immeiately escaring them (by backtracking over them), which is clearly a waste. Therefore, it makes sense to use Optimize copying for ACE, although it is more complicate to implement. It is not very clear, on the other sie, which incremental copying approach (Incremental or Optimize) will reuce the synchronization time between teams TI an T2. However, it is obvious that Incremental copying is the simpler of the two: in the case of Incremental copying TI has to synchronize for the uration of the copying of the ifference, while in the Optimize case we first have to figure out the limits of copying for each processor stack in TI (which may require a complete traversal of the an-parallel tree compute by TI) an then o the copying. Optimize incremental copying for ACE for the an-or tree shown in Figure 8 is illustrate in Figure 9. In ACE we propose to use both Incremental an Optimize incremental copying epening on the situation. The following heuristic tries to balance the excessive unnecessary copying (Incremental copying) an the excessive synchronization time (Optimize copying) by ynamically etecting which of the two options may give the best results. If the choicepoint from which the alternative is being stolen is outsie the scope of any CGE, or it is in the scope of some CGEs an all these CGEs have foun a solution (i.e. each subgoal of each CGE has alreay foun a solution) in the team from which the alternative is being stolen, then Incremental copying will be aopte. Otherwise, Optimize incremental copying will be use. As mentione, Optimize copying requires traversal of all CGEs in which the choicepoint is neste an obtaining the aresses of input-markers an en-markers [18, 16, 13] for each goal in these CGEs. From these aresses an the information in Parcall frames we etermine the useful part of the various stacks to be copie. Finally, note that all processors in a team can cooperate to spee up etecting the áreas to be copie an copying of stack segments from one team to another.

12 choicepoint en of parcall b & c & b 1 l\ j&k b b&c& a Pl Ai g g&h&i c P2 c i i k 1 h i h i P3 TeamTl=(Pl,P2,P3); TI computes itir b&c& a P4 g&h&i P5 1 k J h i h i P6 Team T2 = (P4, P5, P6) T2, initially positione at choice point b, steals alternative g2 from g, oes blin incremental copying, backtracks over, c, i, gl, an h (creating hole in stack of P6, shown shae), computes g2, an restarts all goals to right (h, i, an ). The state of stacks is shown after all this is one. Figure 8: Incremental Copying in ACE b! j&k k b b&c& a Pl 9! 9 g&h&i c P2 <*! 1 k 1 h! h! P3 Fig. (i) TI computes along gl TeamTl=(Pl,P2, P3) T&V b&c& 9 g&h&i P4 P5 P6 Fig (iv) T2 uses optimize incremental copying to pick h2 from T1 WC b&c& 9 g&h&i! b h! h i P4 P5 P6 Fig (ii) T2 steals from g an computes along g2 Team T2 = (P4, P5, P6) Only this part is copie (the frames below CGE (g&h&i) up to choicept h). D- Blin incremental copying, in aition to the copying one here, will also copy the stuff above g in P2, an above hl in P3 an then backtrack over it. Also, note that useless space (holes) might have some useful information copie in them later (e.g., stack frame h in figure iv to the left). T&^r b&c& g&h&i P4 P5 P6 Fig (iii) T2 backtracks to g! b! j&k i\ b b&c& a h 2 *i h. h ««! c i i 9 g&h&i c P4 P5 P6 Fig (v) T2 fins solution h2 for h an recomputes i, an. Figure 9: Optimize Incremental Copying in ACE

5.3 Implementing Sie-effects an Cuts in the ACE Moel One avantage of an an-or parallel moel that recomputes inepenent goals is that since it closely mirrors traitional Prolog execution it can quite

13 5.3 Implementing Sie-effects an Cuts in the ACE Moel One avantage of an an-or parallel moel that recomputes inepenent goals is that since it closely mirrors traitional Prolog execution it can quite easily support full Prolog, i.e. support the execution of orer sensitive preicates such as sie-effect preicates (e.g. rea, write, assert, retract, an calis to ynamic preicates) an cut (!). Essentially, a sie-effect preicate (sep for brevity) shoul be execute only after the sep "preceing" it (preceing in the sense of left-to-right, top-to-bottom Prolog orer) has finishe execution. If the preceing sep has not been execute, the current sep shoul suspen, an resume after the execution of the preceing sep is over. However, given a sep, etermining the sep that "precees" it is akin to solving the halting problem, an therefore the knowlege that the preceing sep has finishe has to be approximate. For example, consier supporting seps in purely or-parallel systems [15]. Here, the preceing sep is assume to be finishe if the or-branch containing it has finishe execution. In other wors, a sep is execute only when the branch containing it becomes the leftmost in the or-parallel tree. Likewise, in purely inepenent an-parallel system, such as &-Prolog, a sep encountere in an inepenent an-parallel goal g in a CGE C is execute only after all the inepenent an-parallel goals to the left of g in the CGE C have finishe execution. If the CGE C containing the goal g is neste insie a goal h, which is an inepenent an-parallel goal in another CGE D, then all the inepenent an-parallel goals in CGE D that are to the left of goal h shoul have finishe, an so on. We can combine the conitions for executing seps in a purely or-parallel system with those for a purely an-parallel system to genérate the conitions for executing a sep in an an-or parallel system such as ACE. Given a CGE (con =$ g\ 8... & g &... & g ), where we assume that the parallel execution of goal < r leas to a sie-effect, the conitions uner which this sie-effect will be execute are given below. Note that the goal g is being recompute in response to Solutions si... s _i that will be foun for goals g\... <7 _i respectively. Let b\... 6 _i be the or-branches in respective search trees of goals g\... <7 _i that lea to these solutions. The conitions are as follows: (i) The or-branch that contains the sep in the search tree of goal g shoul become leftmost n. (ii) The computation of solutions si... s _i shoul have finishe; an the or-branches _i shoul be leftmost in the search tree of their respective goals gi... <7 _i. (iii) If the CGE containing < r is neste insie another CGE then conitions (i) an (ii) must recursively with respect to the equivalent or-parallel tree hol for the inner CGE with respect to the outer CGE. If the CGE is not neste insie other CGEs, then the or-branch in which it appears shoul be leftmost with respect to the root of the whole computation tree. In the rest of this section we present a concrete technique for etermining when a sep's turn for execution has come uring an-or parallel execution. The techniques make use of techniques evelope for &-Prolog [7, 21, 5], MUSE [3], an Aurora [9]. For simplicity, an without loss of generality, we assume that when a processor reaches a sep it repeately performs the above check until it succees (thus the processor busy-waits rather than suspens). However, suspensión woul be use in practice 12 so that the processor that encountere the sep, rather than busy-waiting an wasting cpu-cycles, can o useful work elsewhere Sie-Effects in ACE Note that while verifying the above conitions to check if a sie-effect can be execute, processors nee to access share choice points recore in the share memory (to o the leftmost check). This can be expensive, especially in a non-share memory or a hybri multiprocessor system. One can reuce the number of accesses to share memory by first requiring a processor that has reache a sie-effect to: (a) check if all goals to the left of < r in the current CGE, an those to the left in all the ancestor CGEs have prouce a solution (first part of conition (ii), an conition (iii)) (b) check if the sie-effect is in the leftmost branch, an the solutions to preceing goals in all the CGEs are in leftmost branches (conition (i), secon part of conition (ii), an conition (iii)). Note that check (a) oes not require access to the share área, it is performe wholly within the aress space of the team executing the sie-effect. Check (b) will be mae only after check (a) succees, thus reucing the number of accesses to share área. The above ecomposition also neatly separates the anparallel an the or-parallel components of the check. Both checks (a) an (b) must be implemente efficiently, particularly check (a) since it is going to be performe more often Implementation of suspensión oes not present problems in &-Prolog. Techniques for implementing suspensión more efficiently in MUSE by storing the ifference between the suspene branch an the one that the processor switches to have recently also been evelope by the MUSE group. These techniques can be aapte for ACE. 13 Inee, check (a) can be implemente quite emciently since the appropriate information about the status of an-parallel goals is maintainein the CGE's escriptor, an therefore, performing check (a) involves a simple look-up of the corresponing parcall frame(s).

14 The presence of the sharing requirements allows to sepárate the sie-effect checks for or an anparallelism in a ifferent way. In fact the sharing requirements guarantee that all the branches on the left of a public choice point are complete (otherwise the choice point woul not satisfy the requirements uring the sharing operation). Because of this we nee not perform the check (a) in the public part of the tree. Furthermore, in the private part of the tree the check (b) is unnecessary since no sharing operations have been performe (the sie effect is for sure in the leftmost branch). Thanks to these observations, if P is the bottommost public noe in the current branch, then we can organize the sie-effect check as follows: (1) apply check (a) only to the subtree roote at P; (2) apply check (b) only to the public part of the tree above P. Two main algorithms have been propose to hanle sie-effects in inepenent an-parallel systems (like &-Prolog): synchronization blocks[7, 21], an visiting each ancestor CGE an checking if goals to the left have finishe [5]. Either one of these can be use for performing check (a). To check if a given noe is in the leftmost branch of a given subtree, we nee access to the left sibling noes of the immeiate ancestor choice point noes (given a noe, if the choice point noe above it oesn't have any left siblings, the noe is in leftmost branch of the subtree roote at that choice point). However, the sibling-noes of a choice point are not irectly accessible to a team oing the check, therefore we have to use some other technique to etermine this. The technique that we use parallels the technique propose for MUSE [3]. We use the fact that part of the choice point in ACE is share, an henee the fiels in the share part of choice-point are visible to all teams. Each share choice-point in ACE inclues a teamsbitmap, (from MUSE's workersbitmap). The teamsbitmap inicate which teams are exploiting alternatives of that noe. When the th alternative is picke by a team from a noe, the th bit in the teamsbitmap is set. When the subtree corresponing to the th alternative has been completely explore an backtracke through, the th bit is reset. In the alt-number fiel in the private part of the choice-point, a team also recors the alternative number which it picke from this choice point. Note that the alt-number fiel will oceupy the same memory aress in the aress space of each team that is working below this choice point. The algorithm for verifying leftmostness is thus as follows: the team goes up the execution tree; whenever it reaches a share choice-point it looks at the corresponing teamsbitmap; if there are other teams that are working on the alternatives of this noe, the corresponing images of the choice-point are checke in the aress spaces of these teams to see if the current branch is leftmost. This is one by a a simple arithmetic comparison of the alt-number fiels in these choice points. Note that while checking for leftmostness of sie-effect goals an solutions of goals to the left in a CGE etc. we are only concerne with etermining leftmostness of noes in the subtree of a goal in the CGE (local leftmostness), an not in the whole program search tree (global leftmostness). Several optimizations are necessary to make this algorithm efficient. Firstly, if two teams share a choicepoint N, they will also share all ancestor noes. Thus, one nees to compare two teams only once for the youngest noe they share. Seconly, as in the Aurora scheulers (an propose as an optimization in MUSE), one can keep track of the current noe up to where a worker or team is leftmost. Finally, we can completely avoi accessing any remote choice point by storing in each share frame a bitmap which inicates for each alternative in the choice point whether there is at least one active team working on that alternative Implementing Cut in the ACE Moel The effect of a cut is to prune all branches to the right of the path from the place where the cut is execute to the noe where the clause containing the cut was introuce (cut level). Henee, because a cut can only cut up to the current CGE, conition (iii) is always trivially satisfie. The treatment of cuts is similar to that of sie-effects except that in the case of cut some action can be taken (i.e. some pruning can be one) without the cut becoming leftmost in the entire tree [15]. Basically, a cut can be immeiately exercise in the subtree in which that cut is leftmost. Other branches can be prune only after the cut becomes leftmost in the entire tree. Thus, in ACE when a cut is encountere pruning can be immeiately one up to the point where conitions (i) an (ii) above succee. To prune other parts the team has to wait until conition (i) an (ii) are satisfie right up to the root noe, i.e. the cut becomes a global leftmost. Note that pruning a choice-point consists of clearing that choicepoint an signaling any teams exploring alternatives to the right to terminate execution. Termination of execution by a team means that all the processors in the team abanon execution an backtrack. The efficient techniques use to eal with cuts in or-parallel systems (like those of MUSE [3] an Aurora [9]) can be aapte to ACE. 6 Efficiency an Generality of the ACE Moel We believe that an implementation of the ACE moel will be quite an efficient realization of an or-

15 an inepenent an-parallel system. This is primarily because, as may alreay be evient, in the presence of only or-parallelism ACE will be as efficient as MUSE, while in the presence of only inepenent an-parallelism it will be as efficient as &- Prolog. Therefore, it appears clear that having an ACE system woul be, at least, as powerful an efficient as having both a MUSE an an &-Prolog system, in the sense that now a single system will run or-parallel only programs an an-parallel only programs with similar performance as the MUSE an &- Prolog systems respectively. ACE shoul also combine speeups from programs where both or- an inepenent an-parallelism are available, henee performing even better than the best of MUSE or &-Prolog for such applications. Note that with respect to MUSE, the parts that are copie in an-or parallel execution in ACE for a given program are exactly those that will be copie by MUSE in an equivalent purely orparallel execution of the same program, but, whereas MUSE will copy one large stack segment at any given time, by exploiting inepenent an-parallelism, ACE may sprea this segment over many memory segments in the aress space of the team. This may in principie a some overhea to the copying cost (since many small segments rather than one large segment may have to be copie). However, because each team has múltiple processors, the copying of múltiple segments can be one in parallel. With respect to &- Prolog, ACE oes not introuce any new overheas. The only inefficieney present in the ACE moel is with respect to memory consumption, but that cannot be avoie if we want to use stack-copying for representation of múltiple environments. Given that memory is inexpensive, we hope that this will not be such a big bottleneck. Another important point that shoul be note is that the approach outline in this paper for implementing an-or parallel systems, while presente in terms of combining the types of parallelism present in MUSE an &-Prolog, is actually quite general, an can be applie to implement other systems that exploit an- an or-parallelism, such as Anorra- I [6], Prometheus [24], an IDIOM [11]. It is quite easy to see how Anorra-I, a system that exploits or-parallelism an eterminate epenent anparallelism, can be implemente (the implementation of Anorra-I by Yang, Santos Costa, an Warren is base on bining arrays) using stack-copying. In Anorra-I there is no or-parallelism within anparallel goals since only eterministic goals can be processe in an-parallel (thus it reuces to the case escribe in section 2), thus an-parallel execution can be performe by each team locally. Or-parallelism will be implemente using stack-copying an the memorymanagement scheme escribe above. Likewise, Prometheus [24], a system that exploits or-parallelism an non-eterminate epenent an-parallelism (with no coroutining) by extening CGEs, can be easily implemente using the ACE scheme. In fact, since the DAS-WAM abstract machine on which Prometheus is base is itself base on that of &-Prolog no extra measures nee to be taken apart from those neee to support epenent an-parallelism, which are for the most part orthogonal to the issues ealt with by ACE. IDIOM, which as inepenent an-parallelism to Anorra-I, can also be implemente using the ACE approach. Its implementation can be thought of as a combination of the ACE an Anorra-I implementations, an, again, is straightforwar to erive. 7 Conclusions In this paper, we presente ACE, a moel capable of exploiting both non-eterministic an-parallelism an or-parallelism. We have iscusse both high-level an low level implementation issues an shown how using recomputation the scheme can incorpórate sieeffeets an support Prolog as the user language easily. We have shown how ACE subsumes two of the most successful approaches for exploiting parallelism in logic programming (MUSE an &-Prolog). We have argue how the resulting system has a goo potential for low sequential overhea, can be implemente in a reasonably easy way by extening existing systems, an retains the avantages of both purely or-parallel systems as well as (even noneterministic) purely an-parallel systems. A collaborative implementation of ACE on Sequent an other multiprocessors is uner way at New México State University an University of Mari (UPM). References [1] K.A.M. Ali. Or-parallel Execution of Prolog on the BC-Machine. In Fifth International Con-

Online Appendix to: Generalizing Database Forensics

Online Appendix to: Generalizing Database Forensics Online Appenix to: Generalizing Database Forensics KYRIACOS E. PAVLOU an RICHARD T. SNODGRASS, University of Arizona This appenix presents a step-by-step iscussion of the forensic analysis protocol that