OASys: An AND/OR Parallel Logic Programming System

Size: px

Start display at page:

Download "OASys: An AND/OR Parallel Logic Programming System"

Maria Pope
5 years ago
Views:

1 OASys: An AND/OR Parallel Logic Programming System I.Vlahavas (1), P.Kefalas (2) and C.Halatsis (3) (1) Dept. of Informatics, Aristotle Univ. of Thessaloniki, Thessaloniki, Greece Fax: , URL: (2) Dept. of Computer Science, City College, Affiliated Institution of the University of Shefield, Thessaloniki, Greece Fax: , (3) Dept. of Informatics, University of Athens, Athens, Greece Fax: ,

2 Abstract - The OASys system is a software implementation designed for AND/OR-parallel execution of logic programs. In order to combine these two types of parallelism, OASys considers each alternative path as a totally independent computation (leading to OR-parallelism) which consists of a conjunction of determinate subgoals (leading to ANDparallelism). This computation model is motivated by the need for the elimination of communication between processing elements. OASys aims towards a distributed memory architecture in which the processing elements performing the OR-parallel computation possess their own address space while other simple processing units are assigned with AND-parallel computation and share the same address space. OASys execution is based on distributed scheduling which allows either recomputation of paths or environment copying. We discuss in detail the OASys execution scheme and we demonstrate OASys effectiveness by presenting the results obtained by a prototype implementation, running on a network of workstations. The results show that speedup obtained by AND/OR-parallelism is greater than the speedups obtained by exploiting AND or OR parallelism alone. In addition, comparative performance measurements show that copying has a minor advantage over recomputation. Keywords: Prolog Implementation, AND/OR parallelism, Copying, Recomputation,g

3 INTRODUCTION Exploitation of parallelism in logic programs has drawn much attention to the logic programming community. The aim is to exploit the inherent parallelism embodied in the declarative semantics of logic programs. Various methods have been implemented to enable parallel execution of Prolog programs while not affecting the language semantics. The procedural semantics of sequential Prolog is described by a fixed execution sequence of the program clauses, i.e. a top-down, left-to-right traversal of the AND/OR tree built while trying to reduce a given query. Prolog execution strategy uses chronological backtracking in order to explore all possible ways of reducing a goal. The principle idea behind parallel implementations of Prolog is to traverse the execution AND/OR tree in a parallel mode. Alternative paths are independent since they represent different ways of solving a goal and therefore can be simultaneously searched (OR-parallelism). In order to solve a conjunction of goals, the different paths starting from each goal can be independently searched (AND-parallelism) as long as there exist a reconciliation method to resolve conflicting bindings. During the last decade, research was mainly concentrated in either exploiting OR-parallelism [1, 16, 14] or various forms of AND-parallelism [11, 13, 17] reporting promising results. Practically, the implementation of a full AND/OR Prolog system is too complex to be efficient. There have been efforts to combine both AND and OR in such a way that each type of parallelism will be exploited only when it is efficient to do so [25]. This paper describes a different approach to implementing full AND/ORparallelism in an experimental system called OASys (Or/And SYStem). OASys efficiently implements OR-parallelism by viewing the search space as many independent and deterministic computations which rarely need to communicate. In addition, this feature provides the ability of exploiting the deterministic paths of the proof in an AND-parallel way. OASys aims towards an architecture in which the processing elements performing the OR-parallel computation, possess their own address space while other simple processing units are assigned with AND-parallel computation and share the same address space. Communication is limited only to

4 - 2 - scheduling operations, i.e. allocation of work to available processing elements. The later is accomplished in two different ways: recomputing the alternative path or copying the computation state. In the next section, we briefly describe the problems of parallelising logic programs as well as the main approaches for implementation of AND/OR-parallelism. Section 3 presents the OASys computation model and section 4 its proposed architecture. Section 5 is devoted to presenting performance results. Finally, we conclude with section 6 in which we outline the features of the proposed model and the objectives of our further work. 2. IMPLEMENTING PARALLELISM IN LOGIC PROGRAMMING The main burden in implementing OR-parallelism is the handling of multiple bindings of the same variable produced by different paths. A variety of methods have been implemented: Processors share the same address space while a binding method records the different bindings to an appropriately defined data structure [5, 24]. In shared environment schemes, the overhead reported is small in respect to sequential Prolog execution. Despite their complexity they are adopted in many successful implementations [4, 14]. Processors are independent and have their own copy of the working environment. The main advantage of this approach is the locality of reference. However, there is an extra overhead due to environment copying between parent and child processes although in some cases this is proved to be faster than a sharing binding scheme [2]. Processors view the search space as totally independent computations. Therefore each processor starts computing each path from the root in a sequential Prologlike manner without having to communicate with others at all. This implies recomputation of certain parts of the search space but avoids the need of either having to share variable bindings or copying environments. It is argued that this method can perform well in highly non-deterministic programs where the previous

5 - 3 - two schemes require excessive amount of communication [3, 7, 8, 12, 15]. However using the recomputation method, it is difficult to cope with cut and sideeffect predicates of Prolog. In AND-parallelism the problem has been the handling of shared variables between subgoals. The main approaches to implement AND-parallel systems have been the following: Processors work in a producer-consumer way, communicating through the shared variables. This approach, called Stream AND-parallelism, is adopted by the Committed Choice Non-Deterministic languages [6, 18]. However, the language issues introduced for the sake of efficiency sacrifice to a great extend the semantics of Prolog. Processors work in parallel only when the subgoals do not share variables at all (independent AND-parallelism) [13]. It is, however, difficult to establish independence of variables in goals and extra overhead is introduced due to static or dynamic analysis in order to identify such goals. Processors work in parallel even if they share variables (dependent ANDparallelism) [17]. The main overhead introduced in this case is reconciliation of bindings produced by different processes. Full AND/OR parallelism is difficult to implement due to the overheads introduced by both types of parallelism. In practice, it is more efficient to combine OR with Independent AND-parallelism [3, 4, 9]. In Andorra-I [25], a substantial effort has been made to fully combine the two types. The main idea is to delay the execution of non-determinate goals while performing the determinate computation available in an AND-parallel way. When there is no more determinate computation, i.e. all computations are suspended, the leftmost goal is reduced in an OR-parallel way. As in Andorra-I, OASys exploits both OR-parallelism and dependent AND parallelism. However OASys does not require any determinacy checking, since the search space is considered as independent and determinate OR-parallel branches. This feature allows computation to proceed freely without suspending any nondeterminate goals. The OASys approach imposes a different and much simpler implementation aiming towards higher performance.

6 - 4 - Our previous work [19] proposed a formal model for the resolution of AND/ORparallelism and suggested its implementation and abstract architecture [20]. In OASys, we improve the execution scheme and we eliminate the problems due to centralised control by providing an efficient distributed scheduling. In addition, OASys supports two alternative ways of distributing work, namely copying and recomputation, which are determined at compile time. 3. OASYS COMPUTATIONAL MODEL In Prolog, the search space can be normally represented by an AND/OR tree, where AND nodes represent the body goals of a clause and OR nodes represent the alternative clauses matching a body goal. In OASys, the search space can be represented by a tree where the nodes, called U-nodes, denote the unifications between the calling predicates and the heads of the corresponding procedures. A link between two U-nodes indicates a sequence of procedure calls as it is specified by the program. A link is called successful if the unifications performed in the two U- nodes produce consistent bindings, or unsuccessful otherwise. Figure 1a shows a Prolog program and its corresponding AND/OR tree while figure 1b shows the OASys search tree of the same Prolog program. In Prolog, a node in the execution tree has already produced the variable bindings through unification. In OASys, unifications in the U-nodes are in progress while new U-nodes are continuously produced, i.e. the U-nodes of a path are produced and executed in a, asynchronous and concurrent way. This has the effect that a path containing U-nodes may be speculative, i.e. some of the U-nodes generated may not have existed in the corresponding sequential Prolog execution. [place of Figure 1] OASys supports OR-parallelism by searching simultaneously different paths of the search tree. It also supports AND-parallelism by executing in parallel the U-nodes belonging to the same path. Execution of a path terminates when: all links are successful (a solution is found), or one link is unsuccessful (failure in conjunction), or

7 - 5 - unification in an U-node fails (mismatch). Figure 2 shows the links generated for the paths of the search tree of figure 1 and the most general unifiers found for every unification in U-nodes. The same figure presents the potential of exploiting AND/OR parallelism in OASys by viewing the same tree as three independent computations. The gray shaded areas include U- nodes which also exist in other paths and their meaning is discussed later. [place of Figure 2] 4. THE OASYS ARCHITECTURE Following the above computation model, OASys can be implemented by employing a group of Processing Elements (PEs). Each PE is a shared-memory multiprocessor and consists of three main parts: a preprocessor, a scheduler and an engine (figure 3). In the following sections we present the compilation process, the execution scheme and we discuss some implementation issues. 4.1 Compilation The source Prolog program is compiled in two codes, namely the Compiled Prolog (CP) code and the Compiled Clauses Dependency (CCD) code [21]. The CP code represents the definitions of clauses and during the startup phase, it is distributed to all PE engines. The CP code is broadly based on the APIM [20], it includes only one unify instruction and provides a mechanism to implement arithmetic and other built-in operations. The CCD code is produced by abstract interpretation of the Prolog program and during the startup phase is distributed to all PE preprocessors. It maps each Prolog clause to its actual address in the CP code together with the index table entry for every body subgoal. For each procedure in the program there exists an index table which contains the addresses of the clauses of that procedure as well as the data types of the first argument of their head predicate. The preprocessor executes the

8 - 6 - CCD code and in combination with the index tables generates choices, when there is more than one clause in the definition of a procedure. Each Prolog clause: H i : - B 1, B 2,..., B n is represented in the CCD code notation as: Ci : T 1 (ArgType 1 ), T 2 (ArgType 2 ),..., T n (ArgType n ). where Ci is the address of the clause in the CCD code used to derive the actual address of the clause in the CP code, Tj is the address of the index table of Bj, and ArgTypej is the data type of the first argument of Bj. [place of figure 3] The CCD code for the N queens program together with the graph representation for the CCD code is given in the appendix. 4.2 OASys Execution Scheme Each PE passes through three phases of operation: preprocessing, scheduling and execution. A PE starts generating choices by executing the CCD code (preprocessing phase) and passes them to the scheduler which decides whether available work could be shared with other PEs (scheduling phase). The PE's engine works concurrently by consuming the choices passed from the scheduler. The sequence of choices are used as directives in order to construct the sequence of U- nodes which in turn are assigned to special processing units (AND Processing Units or APUs) which work in parallel (execution phase). The two types of parallelism supported, are handled as follows: OR-parallelism. The different paths of the search tree may be distributed to idle PEs. The scheduler decides whether it will assign the alternative paths to other PEs (OR-parallel execution) or keep them for its own engine (sequential execution). AND-parallelism. The engine executes the U-nodes of a single path in parallel, i.e. it performs the unifications within a U-node by assigning each node to an APU. This is in effect AND-parallelism because the U-nodes in the same path form the conjunction of predicate calls which may solve the query.

9 - 7 - These two types of parallelism are depicted in figure 2. The three different paths are allocated in three distinct PEs while each U-node of the same path is allocated in a different APU The Preprocessor The preprocessor executes the CCD code, thus generating a sequence of choice points along a path which are then passed to the engine. As shown previously, a call to a predicate is represented by an index table address together with the type of its first argument. Execution of the call is actually a search in the corresponding index table to find a clause with a matching data type. If only one matches, the call is deterministic. Otherwise, there is a choice point and one of the alternative addresses is sent to the engine whereas the fate of the others is decided by the scheduler. Note that the preprocessor in OASys finds the sequencing of calls, thus facilitating the indexing operation at run time. In this context, preprocessing differs to the compile time preprocessing of Andorra-I that performs determinacy checking The Scheduler At any time, each PE scheduler is aware of availability of work. This is done by maintaining a table of PE-ids which is updated by broadcasting a single byte that indicates whether a PE has unexplored paths to share or a PE has no more available work. More analytically, when a PE completes the execution of its current path either successfully or not, it backtracks to its local choice points, if any. When there is no more available work in a PE, it is declared as idle and tries to find work by consulting its table. It chooses an entry with available work and makes a request. The PE which is asked to give work: it is locked against other requests, it simulates failure (dummy backtracking) up to the nearest to the root alternative, it transmits the appropriate information, it unlocks and continues executing its current path. A PE's request for work might not be answered from another PE if the latter is locked. Instead of busy waiting, the scheduler decides to try another entry from its table.

10 - 8 - The current implementation supports two possible ways of distributing work which are determined at compile time. The PE with available work may either: send the links followed from root to the available node and force the PE which issued the request to recompute that specific path and continue execution with one alternative, or copy the computation state of the available node and let the PE which issued the request to continue execution with one alternative. The effect of both techniques is illustrated in figure 2, where recomputation includes the portions of paths in the gray shaded areas while copying is concerned only with the portions of paths outside those areas. Scheduling is a crucial issue and could be optimised in various ways. For the moment, work that is nearest to the root has priority for distribution. This minimizes the amount of information transmitted. This could also be avoided by pre-compilation detection of small chunks of work or by employing user annotations to indicate parts of the program that should be kept private to a PE. In addition to these, we are currently experimenting with a more sophisticated scheduler which can decide at run time whether copying or recomputation is more profitable at a specific occasion, taking under consideration various factors that affect one or the other, such as the amount of information to be transmitted, the estimated communication, the recomputation overhead, etc The Engine The engine consists of a work pool, a control unit and a number of special processing units (APUs) which operate in parallel sharing a common memory (figure 4). The work pool accumulates choices supplied by the preprocessor. The control unit executes the machine instructions, constructs U-nodes taking directives from the work pool, and distributes them to the APUs. The APUs receive the addresses of a calling predicate and the head of the corresponding procedure, and perform the unification operation efficiently. Parallel execution of APUs requires synchronised access to unbound variables which may also be accessed by other APUs. This does not add considerable complexity since, according to the execution algorithm, unbound variables may be

11 - 9 - bound in any order, either left to right as in Prolog, or right to left. The mechanism used for the variable bindings is a lock-test-and-set operation. According to this, whenever an APU tries to bind a variable, initially tests if the variable is still unbound. If so, the variable is locked and the APU proceeds by setting a value. Otherwise, the variable binding is retrieved and a consistency check takes place. [place of figure 4] Heap writing operations are also synchronised by setting a lock to the starting address of the space available for writing. An APU cannot allocate space on the heap until an already heap-writing operation by another APU has been completed. However, access operations are free up to the locked address. A similar problem has been addressed in [10] proposing that simple processing units configured as a systolic array used to speed up the parallel logic inferences. 4.3 Implementation Issues The previously described execution scheme adopts quite naturally to hybrid multiprocessor in which parts of the address space are shared among subsets of processors, as for example, in a distributed (loosely coupled) system containing multiple shared-memory multiprocessors. In such a system each shared-memory multiprocessor would be a natural host for a team of processors sharing a common address space. OASys can also be considered as a SPMD (Single Program Multiple Data) machine, its processors executing the same program on different parts of the data, proceeding independently of the others inside their own program. Besides the above mentioned desired ideal architecture, OASys could also be implemented as follows : A network of multiprocessor workstations each one of which corresponds to a PE while its included processors correspond to the various components of the PE. A message passing scalable multiprocessor architecture that provides a shared virtual address space on essentially physically distributed hardware [23]. The address space is truly shared but it is also virtual in the sense that memory is physically distributed across different processors.

12 OASys could also be emulated in a shared memory multiprocessor machine by grouping processors into teams such that each team corresponds to a PE and each processor of the team corresponds to the various components of the PE. The memory should be partitioned into as many independent areas as the number of teams. All processors in a team will share the same logical address space in each area. 5. PERFORMANCE RESULTS A prototype system for OASys has been developed in C and runs on a network of Sun workstations. We executed a number of benchmark programs and the results taken are found to be in accordance with those taken by a simulation of the system [22]. We have chosen to run the following series of Prolog programs: nrev500: naive reverse of a list of 500 elements, merge500: mergesort of a list of 500 elements, zebra: who owns the zebra problem, queens8: the 8-queens problem, ham: the search for hamilton paths in a graph. The first two of the above programs exploit only AND-parallelism while the rest exploit both types of parallelism. 5.1 Relative Speed Performance There are two versions of the prototype system, the sequential and the parallel OASys. In sequential OASys, several aspects concerning synchronization, communication and scheduling have been omitted. Table 1 compares the performance of sequential OASys and parallel OASys employing one PE, with other well known implementations of Prolog, like sequential Andorra-I, Eclipse and Sicstus. All times shown are actual cpu times in seconds and obtained on the same machine (Sun Enterprise 3000). The runtimes of Andorra-I were obtained with preprocessing for determinacy checking.

13 [place of table 1] 5.2 OR-Parallel Performance Table 2 shows the speedups measured for 1 to 10 PEs, each with a single APU, using a network with the same number of workstations as the number of PEs. The speedups concern only OR-parallelism exploited by the benchmark programs. Since the programs are non-deterministic, the speedup is close to linear, as expected. It is worth noticing that the performance of recomputation and copying is similar with a small number of PEs, giving a precedence to copying as the number of PEs increases. Although the data transfer in recomputation is 4 to 7 times less than copying, the above results are justified by the fact that the transfer time in copying is much less than the re-execution time in recomputation. The overall amount of data needed to be transferred in copying produces a small overhead since current networks are capable of transferring large amounts of data within a very short time. In addition, copying is efficiently implemented, since each PE has identical but independent logical address space and the stack portions are copied without reallocating any pointers. Bearing in mind that copying is profitable when work is allocated lower in the tree over recomputation which is more profitable closer to the root, a hybrid scheduling scheme deciding which method is more appropriate at any given occasion is expected to result in better actual performance. [place of Table 2] 5.3 AND-Parallel Performance The current prototype system implements AND-parallelism by creating the same number of processes as the number of required APUs in each workstation. Having measured the actual timings, AND-parallel performance is estimated by taking into acoount the following factors: the parallel overhead due to variable locking over sequential execution with 1 APU,

14 the overhead due to book keeping operations of non-determinate nodes which demand more processing than deterministic nodes, the overhead of allocating unifications to APUs which increases directly proportionally to the number of APUs. Table 3 shows the estimated speedups for the same benchmark programs referring to a single PE with 1 to 10 APUs. [place of Table 3] Some of the programs show good speedup since they are purely deterministic, like nrev500 and merge500. Others, like ham and zebra, have long deterministic paths between choice points and AND-parallelism can be exploited without much overhead. Speedup in queens8 is low and flattens as the number of APUs increases. This is because the search space has short deterministic paths between choice points and therefore more APUs do not contribute to the exploitation of ANDparallelism. On the other hand, for all programs, we do not expect better performance with more than 10 APUs since the preprocessor s total execution time is 8-15% of the total execution time of an engine with 1 APU. That means that as the number of APUs increases, the above percentage increases and the execution times become roughly equal and the total execution time of a PE will be the execution time of the preprocessor, i.e. constant. 5.4 AND/OR-Parallel Performance Results of combing both types of parallelism are shown in Table 4, for two representative programs. The results were obtained by running queens8 and ham with different number of PEs, each one configured with a fixed number of APUs at every run (1 to 6). The higher performance is reached with 10 PEs with 6 APUs each (i.e. 60 APUs in total). [place of Table 4] 6. CONCLUSIONS AND FUTURE WORK

15 In this paper we have described the OASys model for implementing full AND/ORparallelism. OASys achieves AND/OR-parallelism by assigning each path of the search tree into different processing elements (PEs) while at the same time assigning each node of an individual path into different processing units (APUs). The above scheme adopts a modular design which is easily expandable and highly distributed, thus limiting communication only to scheduling operations. Allocation of work to available processing elements can be achieved by either copying or recomputing. OASys software architecture aims towards a distributed system in which the processing elements rarely need to communicate and could be implemented in various ways. However, an ideal architecture would be a hybrid multiprocessor containing a multiple shared memory multiprocessors connected by a message passing network. We have implemented a software prototype in order to investigate the model's performance. By running a series of example programs we showed that the two scheduling methods performed similarly, giving a minor precedence to copying. In both cases, the speedup obtained is linear to the number of processing elements. However, there is an upper bound in speeding up the execution which is determined by the ratio of the time required to generate the alternative paths over the time of performing the deterministic computations on these paths. We are currently working on an OASys implementation which integrates both scheduling methods during the execution of a program. This approach aims towards a more sophisticated scheduler which will decide, at any occasion: which of the two methods (copying or recomputation) to use depending on the estimated overhead, whether it is more profitable to keep an alternative path for sequential execution rather than sharing it with other processing elements, depending on the estimated amount of work in this path, which of the available alternative paths to send, depending on the estimated relative profit of sharing one path compared to the profit of sharing any other. We believe that the performance will increase, since the overhead from communication will be further optimised.

16 ACKNOWLEDGMENTS We would like to thank D. Xochellis, D. Benis, C. Berberidis and E. Sakellariou for their help in the prototype and the collection of performance data. Many thanks to Rong Yang for the help with Andorra-I and the anonymous referees for their constructive comments. REFERENCES [1] K. Ali and R. Karlsson. The Muse Or-Parallel Prolog model and its performance. Proc North American Conf. on Logic Programming, Austin, USA, Oct. 1990, [2] K. Ali and R. Karlsson. Scheduling Speculative Work in MUSE and Performance Results. Int. J. of Parallel Programming, 21 (6), 1992, [3] L. Araujo, J.J. Ruz, PDP: Prolog Distributed Processor for Independent AND/OR Parallel Execution of Prolog, Proc. of the 11th ICLP (P.Van Hentenryk ed.), MIT Press, 1994, [4] U. Baron, J.C. de Kergommeaux, M.Hailperin, M.Ratcliffe, P.Robert, J.Syre and H.Westphal. The parallel ECRC Prolog system PEPSys: An overview and evaluation results. Future Generation Computer Systems '88 Conf., Tokyo 1988, [5] A. Ciepielewski and S. Haridi. A formal model for OR-Parallel Execution of Logic Programs. Proc. in Format Processing, IFIP, 1983, [6] K.L.Clark and S.Gregory. Parlog: a parallel programming in logic, ACM Transactions for Languages and Systems, 8 (1), 1986, [7] W.F. Clocksin. Principles of the DelPhi parallel inference machine. The Computer Journal, 30 (5), 1987, [8] G. Gupta and M. Hermenegildo. And-Or Parallel Prolog: A Recomputation Based Approach. New Generation Computing, 11, 1993,

17 [9] G. Gupta, M. Hermenegildo, E. Pontelli, V.S. Costa. ACE: And/Or-Parallel Copying-Based Execution of Logic Programs. Proc. of the 11th ICLP (P.Van Hentenryk ed.), MIT Press, 1994, [10] Y.F.Han and D.J.Evans. Parallel Inference Algorithms for the Connection Method on Systolic Arrays. Int. J. Computer Math., Vol. 53, 1994, [11] M. Hermenegildo and K. Green. &-Prolog and Its Performance: Exploiting Independent And-Parallelism. Intern. Conference on Logic Programming ICLP 90, MIT press, 1990, [12] Z. Lin. Self-organizing Task Scheduling for Parallel Execution of Logic Programs. Proc. of the Intern. Conf. on Fifth Generation Computer Systems, ACM, 1992, [13] Y. Lin and V. Kumar. AND parallel execution of Logic Programs on a Shared- Memory Multiprocessor. The J. of Logic programming, 10, 1991, [14] E. Lusk, R. Butler, T. Disz, R. Olson, R. Overbeek, R. Stevens, D.H.D. Warren, A. Calderwood, P. Szeredi, S. Haridi, P. Brand, M. Carlsson, A. Ciepielewski, and B.Hausman. The Aurora OR-parallel Prolog system. New Generation Computing, 7 (2,3), 1990, [15] S. Mudambi, J. Schimpf. Parallel CLP on Heterogeneous Networks, Proc. of the 11th ICLP (P.Van Hentenryk ed.), MIT Press, 1994, [16] T.J. Reynolds and P.Kefalas. OR-parallel Prolog and search problems in AI applications. In Proc. of the 7th Intern. Conference on Logic Programming eds. D.H.D.Warren and P.Szeredi, Jerusalem, 1990, [17] K. Shen. Exploiting And-parallelism in Prolog: the Dynamic Dependent Andparallel Scheme (DDAS). Proc. of Joint Intern. Conf. and Symp. on Logic Programming, 1992, [18] K. Ueda. Introduction to Guarded Horn Clauses. ICOT Research Center, TR- 209, Tokyo, [19] I. Vlahavas and P. Kefalas. A Parallel Prolog Resolution Based on Multiple Unifications. Parallel Computing, North-Holland, vol. 18, 1992, [20] I. Vlahavas and P. Kefalas. The AND/OR Parallel Prolog Machine APIM: Execution Model and Abstract Design. The Journal of Programming Languages 1, 1993,

18 [21] I. Vlahavas, Exploiting AND-OR Parallelism in Prolog: The OASys Computational Model and Abstract Architecture, Journal of Systems and Software (to appear). [22] I. Vlahavas, P. Kefalas, I. Dakellariou and C. Halatsis, The Basic OASys Model: Preliminary Results, Proceedings of 6th National Conf. on Informatics, Greek Computer Society, Athens 1997, [23] D.H.D. Warren and S. Haridi. Data Diffusion Machine-A Scalable Shared Virtual Memory Multiprocessor. Proceedings of the Int. Conf. on Fifth Generation Computer Systems, 1988, [24] D.S. Warren. Efficient Prolog Memory Management for Flexible Control Strategy. Int. Symp. on Logic Programming, 1984, [25] R. Yang, T. Beaumont, V. Santos Costa, D.H.D.Warren. Performance of the Compiler-based Andorra-I System, Proc. of the 10th Intern. Conference in Logic Programming (D.S. Warren ed.), MIT Press, 1993,

19 APPENDIX Consider the following Prolog program for N queens: queens([],r,r). queens([h T],R,P):- delete([a,[h T],L), safe(r,a,1), queens(l,[a R],P). % C1 % C2 delete(x,[x Y],Y). delete(x,[y Z],[Y W]):- delete(x,z,w). % C3 % C4 safe([],_,_). safe([h T],U,D):- H-U =\=D, U-H =\=D, D1 is D+1, safe(t,u,d1). % C5 % C6 Given the query for 4 queens:?- queens([1,2,3,4],k,m) % Query the CCD code, the index tables and the graph representation for the CCD code are given below: T1: nil C1 T2: var C3 T3: nil C5 {queens} list C2 {delete} var C4 {safe} list C6 C1: true. queens C2: T2(var), T3(var), T1(var). nil var C3: true. C4: T2(var). solution delete var C5: true. C6: T3(var). var Query: T1(list). safe var nil

20 - 2 - Figure and Table captions Figure 1: Figure 2: Figure 3: Figure 4: Representation of search trees in Prolog and OASys Nodes and links in OASys search tree Overview of the OASys architecture The OASys Engine configuration Table 1: Table 2 : Table 3: Table 4: Runtimes (in seconds) for OASys and other Prolog systems OR-parallel Speedups AND-parallel Speedups AND/OR-parallel Speedups of the Queens8 and Hamilton path problem

21 - 3 -?- q(a,b) q(x,y) q(x,y):- a(x,y). a(x,y):-b(x,y),c(y). a(x,y):-d(x),e(y). b(1,2). c(3). d(1). d(2). e(3).?- q(a,b). b(x,y) a(x,y) a(x,y) c(2) d(x) a(x,y) e(y) b(1,2) FAIL d(1) d(2) e(1) (a) AND/OR tree generated by Prolog q(a,b)=q(x,y) a(x,y)=a(x',y') a(x,y)=a(x',y') b(x',y')=b(1,2) d(x')=d(1) d(x')=d(2) c(y')=c(3) e(y')=e(3) e(y')=e(3) (b) Search tree in OASys Figure 1

22 - 4 - OR Ð1={A=X,B=Y} successful Ð2={X=X', Y=Y'} Ð1={A=X,B=Y} successful Ð1={X=X',Y=Y'} Ð1={A=X,B=Y} successful Ð2={X=X',Y=Y'} AND successful successful successful Ð3={X'=1,Y'=2} unsuccessful Ð3={X'=1} successful Ð3={X'=2} successful Ð4={Y'=3} Ð4={Y'=3} Ð4={Y'=3} PATH FAILED PATH SOLVED PATH SOLVED Figure 2

23 - 5 - System Communication Network PRE- PROCESSOR SCHEDULER PRE- PROCESSOR SCHEDULER.... ENGINE ENGINE PE PE Figure 3

24 - 6 - Scheduler work pool Control Unit APU APU... APU Memory Interconnection Network (MIN) MEMORY ENGINE Figure 4

25 - 7 - Sequentia l Parallel OASys Sequentia l Eclipse Sicstus OASys Andorra-I queens hamilton zebra merge nrev Table 1

26 - 8 - PEs ham queens zebra Copy Recom. Copy Recom. Copy Recom Table 2

27 - 9 - APUs ham queens8 zebra nrev500 merge Table 3

28 PEs 1 APU 2 APUs 4 APUs 6 APUs 1 APU 2 APUs 4 APUs 6 APUs Table 4

An Or-Parallel Prolog Execution Model for Clusters of Multicores

An Or-Parallel Prolog Execution Model for Clusters of Multicores João Santos and Ricardo Rocha CRACS & INESC TEC and Faculty of Sciences, University of Porto Rua do Campo Alegre, 1021, 4169-007 Porto,