Charanpal Dhanjal. January Supervised by Simon Colton and Stephen Muggleton

Size: px

Start display at page:

Download "Charanpal Dhanjal. January Supervised by Simon Colton and Stephen Muggleton"

Melina Byrd
6 years ago
Views:

1 Graph based θ-subsumption algorithms Charanpal Dhanjal January 2003 Supervised by Simon Colton and Stephen Muggleton

2 Table of Contents 1 INTRODUCTION TECHNICAL BACKGROUND First Order Logic Subsumption Graph Theory LITERATURE REVIEW REDUCTION OF MATCHING CANDIDATES USING GRAPH CONTEXT REDUCTION OF MATCHING CANDIDATES USING LITERAL CONTEXT CLIQUE AND THE GENERAL SUBSUMPTION PROBLEM SUBSUMPTION BETWEEN K-LOCAL CLAUSES EXPERIMENTAL SETUP SUBSUMPTION CODE THE MESH DATASET PREPARING THE DATA SUBSUMPTION TESTING RESULTS AND ANALYSIS CONCLUSIONS AND FUTURE WORK REFERENCES BOOKS, JOURNAL ARTICLES AND CONFERENCE PAPERS WEB REFERENCES GLOSSARY APPENDIX A SUBSUMPTION TESTING RESULTS

3 Abstract The efficiency of Inductive Logic Program learners is heavily dependant on the θ- subsumption problem. This problem is well known to be NP-complete for first order logic. However, efficient algorithms can be realised by mapping this problem to graph problems. In particular, subsumption can be mapped to both graph isomorphism and finding the maximum clique of a substitution graph. Both of these approaches are suggested in [SHW96] and tests on an artificial dataset show that they perform well compared to existing θ-subsumption algorithms. The aim of this project is to rationally reconstruct some of the experiments undertaken in [SHW96]. Time constraints have meant we could only implement and make empirical measurements on plain, deterministic and literal context subsumption. The literal context algorithm at depth 2 performs the best on positive subsumption examples. For negative examples, deterministic subsumption is the fastest algorithm, with a mean subsumption time of just 20 milliseconds. This result is the only inconsistency with those in [SHW96], although we suspect it is caused by too few negative examples. Interesting future extensions of this work include comparing the maximum clique algorithm with those tested here and integrating the most promising algorithm into existing ILP systems. 3

4 1 Introduction Inductive Logic Programming (ILP) involves the induction of first order Horn clauses from examples and background knowledge. The efficiency of ILP learners is dependent on θ-subsumption [P70] [Rob65] because both finding redundant rules and testing how many examples a certain rule covers involve many subsumption tests. However, θ-subsumption of two Horn clauses is NP-complete in general (see [KL94] for a proof). See Section 1.1 for a definition of θ-subsumption. Since subsumption is critical to the performance of ILP learners, this project aims to find the empirical performance of efficient subsumption algorithms in the average case. We aim to rationally reconstruct some of the experiments undertaken in [SHW96], which describes several strategies for improving subsumption performance. Scheffer et al. show that on an artificial dataset all of their subsumption algorithms give an increase in performance compared with plain subsumption. The graph based algorithms in [SHW96] fall into two main categories: ones that reduce the number of matching candidates using contextual information and ones that map θ-subsumption to finding the maximum clique of a substitution graph. The contextual algorithms are based on mapping θ-subsumption to the graph isomorphism problem. The idea is based around using the contextual information of a vertex in one graph to reduce the number of matching vertices in the other graph. (See Section 1.1 for definitions of graph and graph isomorphism). Due to time constraints, we were only able to implement and obtain results for a subset of the algorithms in [SHW96]. In particular, we compare the performance of the literal context, plain and deterministic subsumption. Time did not allow for the implementation of the maximum clique and graph context subsumption algorithms. The following subsection acquaints the reader with the necessary background information to understand the subsumption algorithms explained in Section 3. Section 4 describes the experimental setup and Section 5 examines the results. The final section gives a summary of the results and directions for further research. 1.1 Technical Background This subsection defines some important concepts including subsumption, least general generalisation, deterministic subsumption and graph isomorphism First Order Logic Definition 1: Term [M97] A term is a constant, variable or function applied to any term. E.g. X, height, s(s(0)). Definition 2: Literal [M97] A literal 1 is any predicate or its negation applied to any set of terms. Examples include human(john), parent(x, bob), wheels(car, 4). 1 Note that for literals in clauses other than variable-only datalog clauses, we will represent variables with uppercase letters and constants with lowercase letters. 4

5 Definition 3: Clause [M97] A clause is a disjunction of literals whose variables are universally quantified. We will represent the clause V (L 1 L n ) as the set {L 1,, L n } where V is the set of variables occurring in literals L 1 L n. For example A(r(b) g(a) x) is a clause and will be represented as {r(b), g(a), x}. Definition 4: Horn Clause [ML] A Horn clause is a clause which contains at most one positive literal. A Horn clause which has exactly one positive literal (called a definite clause) can be written in the form: H (L 1 L n ) where H, L 1 L n are positive literals, H is called the head and L 1 L n are called the body literals of the clause. Definition 5: Datalog Clause A datalog clause is a clause such that all literals have arguments restricted to variables or constants (i.e. a datalog clause is function-free). An example of a datalog clause is {r(a, b), f(x)}, whereas {f(x, s(x)), a} is not a datalog clause since the second argument of the literal f(x, s(x)) is a function of x. Definition 6: Variable-only Datalog clause A variable-only datalog clause is a clause such that all literals have arguments restricted to variables. Definition 7: Substitution [M97] A substitution is any function that replaces variables by terms. Given a substitution θ and a literal L we write Lθ to denote the result of applying substitution θ to L. A substitution is written as the set {V 1 /T 1,, V n /T n } where V 1 V n are variables, T 1 T n are terms and X/Y means that variable X should be replaced with term Y. We can apply a substitution θ to a clause C, written Cθ, by using the substitution θ on each literal in C. Example 1: Let clause C be the clause {f(a), X, g(y, a} and θ be the substitution {X/f(x), Y/b}, then the result of applying θ to C is Cθ = {f(a), f(x), g(b, a)}. Definition 8: Logical Implication A clause C implies D, written, C D, if D is true whenever C is true. The expression C D is equivalent to the disjunction C D. Reverse implication is written with the symbol, where D C are C D are equivalent. Definition 9: Clause Length The length of a clause C, written C, is the number of literals in that clause. 5

6 1.1.2 Subsumption Definition 10: θ-subsumption [P70] A clause C θ-subsumes D, written C θ D, if and only if there exists a substitution θ such that Cθ D. θ-subsumption is an incomplete, decidable consequence relation. Example 2: Let C be the clause {f(x, d), g(y)} and D be the clause {f(d,d), g(a), p(t)}, then C subsumes D under the substitution θ = {X/d, Y/A}. Definition 11: Length-bounded θ-subsumption [SHW96] A clause C θ-subsumes D with a length bound, written C θ D, if and only if there exists a substitution θ such that Cθ D and C D. Next, we define the least general generalisation (LGG) of two clauses as the least general clause that can unify with both clauses. Least general generalisation is also commonly referred to as anti-unification. Definition 12: Term Least General Generalisation We can say that a term t 1 is more general than term t 2, written t 1 t 2, if there exists some substitution θ such that t 1 θ = t 2. The least general generalisation of two terms t 1 and t 2 is the term t lgg such that t lgg θ 1 = t 1 and t gg θ 2 = t 2, and there does not exist a term less general that t lgg. Example 3: Let t 1 = f(b, t(a), X) and t 2 = f(b, g(a), c), then the least general generalisation t lgg of t 1 and t 2 is f(b, Y, X). To see why note that t lgg {Y/t(a)} = t 1 and t lgg {Y/g(a), X/c} = t 2 and there does not exist a term t 3 and a substitution θ such that t lgg θ = t 3. Definition 13: Clause Least General Generalisation (LGG) [ML] Let C 1 and C 2 be two clauses with least general generalisation L. Then L C 1 and L C 2 and for every clause G such that G C 1 and G C 2, there exists a substitution θ such that Gθ = L. Example 4: Let C 1 = {f(x), g(y), h(z)} and C 2 = {f(a), g(a), g(c)}, then the least general generalisation of C 1 and C 2 is L = {f(x), g(y), g(z)}. L C 1 can be written as L C 1 X Y Z (f(x) g(y) g(z)) (f(x) g(y) h(z)). This expression is true if we consider the cases where f(x) is true and false. In the case that f(x) is true, the disjunction C 1 = f(x) g(y) h(z) must also be true and so is L C 1 true. If f(x) is false, then L cannot be true since L claims the disjunction f(x) g(y) g(z) is true for all X. Hence if f(x) is false L is true, and if f(x) is true, C 1 is true and so L C 1 is always true. To show L C 2 X Y Z (f(x) g(y) g(z)) (f(a) g(a) g(c)) is true, we can use similar reasoning substituting f(a) for f(x). A clause C deterministically subsumes D if for some ordering of literals in C, if we process these literals sequentially, then at any one stage there will be only one literal 6

7 in D that matches the current literal in C. Later on, we will see the determinacy concept extended to include contextual information for each literal. This makes it more likely that literals in C will deterministically match in D. Definition 14: Deterministic subsumption [SHW96] Let C = c 0 {c i } and D = d 0 d 1,,d m be Horn clauses. C deterministically θ subsumes D by θ = θ 0 θ 1 θ n if and only if there exists an ordering c 1,,c n of the c i such that for all i, 1 i n there exists exactly one θ i, such that {c 1,,c i }θ 0 θ i D. Example 5: Let C be the clause a(x) b(y) c(z), and D be a(x) b(x) c(a) d(z). Then C deterministically subsumes D since for each stage in the subsumption process there is exactly one substitution that is consistent with the previous substitutions. In this case, the substitution θ in Cθ D is {Y/x, Z/A} = θ 1 θ 2, with θ 1 = {Y/x} and θ 2 = {Z/A}. At the first substitution, b(y) can only match with the literal b(x) in D and at the second substitution c(z) can only match with the literal c(a) Graph Theory The following definitions relate to graph isomorphism and graph sub-isomorphism, since these concepts are useful for understanding the algorithms in Sections 2.1 and 2.2. Definition 15: Graph isomorphism Let G and H be graphs with vertices V(G) = {u 1, u 2, u n } and V(H) = {v 1, v 2, v m } and edges (u i, u j ) E(G) and (v k, v l ) E(H). Then, G and H are isomorphic iff there is a one-to-one mapping f such that: ( x, y) E( G) ( f ( x), f ( y)) E( H ) Figure 1 shows an example of two isomorphic graphs. 2 1 A 5 C B 3 4 D Figure 1: Two isomorphic graphs. Under the mapping 1 B, 2 D, 3 A, and 4 E and 5 C the first graph is equivalent to the second. E 7

8 Definition 16: Subgraph A graph H is a subgraph of graph G, if the vertices V(H) V(G) and edges E(H) E(G). Definition 17: Subgraph isomorphism Let G and H be graphs and K be a subgraph of H. Then if G is isomorphic with K, G is subgraph isomorphic with H. 2 Literature Review This project is based primarily on the paper Efficient θ-subsumption based on graph algorithms [SHW96] by Scheffer et al. They describe two main subsumption strategies: mapping subsumption to the graph isomorphism problem, and finding the maximum clique of a substitution graph. The complexity of checking if clause C subsumes D is O( D C ), but if C deterministically subsumes D, then this property can be tested in at most O( C 2 D ) unification attempts. The following algorithm from [SHW96] tests for deterministic subsumption and additionally invokes plain subsumption on the non-deterministically matching literals in C: Algorithm 1: Deterministic subsumption [SHW96] 1. For each literal l 1 C: a. If l 1 matches exactly one literal l 2 D with l 1 µ = l 2, substitute C with µ. b. If l 1 cannot be matched with any literal l 2 D, then C cannot subsume D. c. If l 1 matches non-deterministically with literals in D, then add l 1 to C, the set of non-deterministically matching literals in C. 2. Start with the clause C substituted so far and use a backtracking algorithm to test if C θ D and hence if C subsumes D. Scheffer et al. note that literals in C are rarely deterministically matched in D in the data set they use, and so it is very unlikely that a clause deterministically subsumes another. However, the condition that a literal in C must match at least one literal in D reduces the complexity for cases where C does not subsume D. As we will see in later sections, deterministic subsumption can be generalised to include contextual information for each literal. If for each literal in C, the number of matching candidates in D can be reduced then it is more likely that C deterministically subsumes D. 2.1 Reduction of matching candidates using graph context Wysotzki [WSK81, UW81] proposes a method for solving the graph isomorphism problem by reducing the number of matching candidate vertices using contextual information. Scheffer et al. translate the subsumption problem to graph isomorphism by defining the occurrence graph of a clause. 8

9 Definition 18: Occurrence graph [SHW96] The occurrence graph, G, of a datalog 2 clause C has edges (l i, l j, π i π j ) E(G) iff there is a variable x that occurs in literal l i at argument position π i and in l j at argument position π j. Note that we will assume that an edge does not occur if l i = l j and π i = π j since Scheffer et al. s examples correspond with this assumption. Example 6: Let C be the datalog clause {p(a, b), p(b, c), q(x, y, c)}. Then the occurrence graph of C has an edge (p(a, b), p(b, c), 2 1) since literals p(a, b) and p(b, c) have variable b at positions 2 and 1 respectively. Similarly, C also has an edge (p(b, c), q(x, y, c), 2 3) since c occurs at position 2 in p(b, c) and position 3 in q(x, y, c). See Figure 2 for a diagrammatic representation of this graph. p(b, c) p(a, b) q(x, y, c) Figure 2: The occurrence graph of the datalog clause {p(a, b), p(b, c), q(x, y, c)}. See Example 6. The graph context of a literal at depth d, is the set of all paths of length d from that literal in the occurrence graph. Definition 19: Graph context [SHW96] The context of depth d of a literal l 1 from a datalog clause C, con gra (l 1, d, C) is the set of paths p 1.π 1 π 2.p 2 π (n-1) π d.p d, iff there exists a set of edges {(l 1, l 2, π 1 π 2 ),, (l d-1,l d, π d-1 π d )} in the occurrence graph, such that p i is the predicate symbol of l i. Example 7: Let datalog clause C = {p(a, b), p(b, c), q(x, y, c)}. The occurrence graph of this clause has edges {(p(a, b), p(b, c), 2 1), (p(b, c), q(x, y, c), 2 3)} as shown in Example 6. The context of literal p(a, b) at depth 1 is con gra (p(a, b), 1, C) = {p 2 1 p} and at depth 2 it is con gra (p(a, b), 2, C) = {p 2 1 p 2 3 q, p 2 1 p 1 2 p}. The first path of con gra (p(a, b), 2, C) corresponds to the sequence of literals <p(a, b), p(b, c), q(x, y, c)>, and the second to the sequence <p(a, b), p(b, c), p(a, b) >. 2 The definitions given in sections 2.1and 2.2 refer to datalog clauses. If a clause C is not a datalog clause then we can compute a datalog clause C using the following process. For each literal p(t 1, t n ) C, C has a literal p (x 1, x m ) such that the x i are the variables occurring in t 1, in order of their appearance. 9

10 Proposition 1 [SHW96] Let C and D be clauses and l 1 C, l 2 D be literals. Let the depth d be any natural number and let l 1 µ= l 2, where µ is a matching substitution. If con gra (l 1, d, C) con gra (l 2, d, D), then there is no θ, such that Cµθ D. See [SHW96] for a proof. Due to Proposition 1, a literal cannot be matched against another literal if its context is not able to embed inside the other literal s context. This implies that a literal can only match another literal if its context is a subset of the other literal s context. This gives rise to Algorithm 2, which is an extension of deterministic subsumption in Algorithm 1: Algorithm 2: Graph context subsumption 1. For each literal l 1 C: a. If con gra (l 1, C, d) is a subset of the graph context of exactly one literal l 2 D with l 1 µ = l 2, substitute C with µ. b. If l 1 s graph context cannot be embedded in the graph context of some literal l 2 D, then C cannot subsume D. c. If l 1 and its graph context match non-deterministically with literals in D, then add l 1 to C, the set of non-deterministically matching literals in C. 2. Start with the clause C substituted so far and use a backtracking algorithm to test if C θ D. Again, use the graph context to reduce the number of matching candidate literals in D for each literal in C. Example 8 Let datalog clause C = {p(a, b), p(b, c), q(x, y, c)} and D = {p(r, s), p(s, t), q(p, q, t)}. Then from Example 5 we know that the occurrence graph of C has edges {(p(a, b), p(b, c), 2 1), (p(b, c), q(x, y, c), 2 3)}. The occurrence graph of D has edges {(p(r, s), p(s, t), 2 1), (p(s, t), q(p, q, t), 2 3)}. The context at depth 1 of p(a, b) is con gra (p(a, b), 1, C) = {p 2 1 p}, and this can be embedded in the context of p(r, s) which is con gra (p(r, s), 1, D) = {p 2 1 p}. In a similar fashion con gra (p(b, c), 1, C) = {p 1 2 p, p 2 3 q} is a subset of con gra (p(s, t), 1, D) = {p 1 2 p, p 2 3 q} and con gra (q(x, y, c), 1, C) = {q 3 2 p} is a subset of con gra (q(p, q, t)}, 1, D) = {q 3 2 p}. Each literal in C matches a literal in D and the context of that literal in C is a subset of a corresponding literal in D and so C subsumes D. 2.2 Reduction of matching candidates using literal context Another algorithm based on graph isomorphism (see Algorithm 3 below) relies on the principle that identical variables in one clause must match with identical variables in the other clause. This algorithm uses a literal graph as opposed to the occurrence graph in the graph context algorithm of Section 2.1. Definition 20: Literal graph [SHW96] Let l i and l j be literals in datalog clause C. The literal graph, G, of clause C, has edges (l i, l j ) E(G), such that l i and l j have a common variable x. 10

11 Example 9: Let C be the datalog clause {p(x, y), p(y, z), p(a)}. Then the literal graph of C had edges E(G) = {(p(x, y), p(x, y)), (p(y, z), p(y, z)), (p(a), p(a)), (p(x, y), p(y, z))} since each literal has a variable in common with itself and p(x, y) and p(y, z) have variable y in common. See Figure 3 for a diagrammatic representation of this graph. p(a) p(x, y) p(y, z) Figure 3: The literal graph of the datalog clause {p(x, y), p(y, z), p(a)} from Example 9. Definition 21: Literal context [SHW96] The context at depth d of a literal l C is the set con lit (l, C, d) containing those literals that can be reached via a path at most depth d in the literal graph of C. In addition, we can write con lit (l, C, d, k) to limit the context to a random subset of size k of con lit (l, C, d). This limitation on the size of the context prevents it growing exponentially with d. Example 10: For the clause C = {p(x, y), p(y, z), p(a)} in Example 9, the context at depth 1 of p(x, y) is the clause {p(x, y), p(y, z)}. Note at depth 1, the context of a literal includes itself. Proposition 2 [SHW96] Let C and D be datalog clauses and l 1 C and l 2 D be literals, such that l 1 µ = l 2. If there is no θ, such that con lit (l 1, C, d) µθ con lit (l 2, D, d) then there is θ no such that Cθ D. A proof is given in [SHW96]. There is a simple counter example for this proposition, which we will give here. Let clause C = {f(x)} and D ={f(a)}, then f(x) matches f(a), by f(x) µ = f(a), with µ = {X/a}. The context of f(x) at depth 1 is con lit (f(x), C, 1) = {f(x)} and the context of f(a) at depth 1 is con lit (f(a), D, 1) = { }. There does not exist a θ such that con lit (f(x), C, 1) µθ con lit (f(a), D, 1) and since f(a) is the only literal in D, Proposition 2 says C cannot subsume D. But Cθ D is true with θ = {X/a}. We can refine Scheffer et al. s definitions in both this section and Section 2.1 by assuming that when they refer to datalog clauses, they mean variable-only datalog 11

12 clauses. This would correlate with their examples and proofs. A variable free datalog clause can be transformed to a datalog clause by using a process called flattening. Example 11: Let datalog clause C = f(a, b) f(a), f(b). Flattening replaces each constant in this clause by a new variable as follows: C = f(a, B) f(a), f(b), a(a), b(b). This clause is augmented with the background knowledge clauses a(a) and b(b). This ensures that the clause C is equivalent to C. The determinacy concept can be generalised such that a literal l 1 C, can only match a literal l 2 D if con lit (l 1, C, d) θ con lit (l 2, D, d). If the context of some literal in C does not subsume the context of any literal in D, then C cannot subsume D. Definition 22: Generalized determinacy [SHW96] Let C and D be Horn clauses and let k be the maximum number of literals in any literal s context of lookahead depth d. Then C con(d, k)-deterministically subsumes D by θ = µ 0, µ n, written C θ dkdet D, iff there exists an ordering l 1,,l n of literals in C, such that for all i, 1 i C there exists exactly one µ such that there is an l D with l i µ = l and con lit (l i, C, d, k) µθ i con lit (l, D, d). Scheffer et al. point out that at depth 0, the context of a literal is the empty set and hence Definition 22 becomes identical to deterministic subsumption. For depth d > 0, the context constraint reduces the number of matching candidate literals in D, for each literal in C. Algorithm 3 uses a literal s context to generalise the deterministic subsumption algorithm (Algorithm 1). Algorithm 3: Literal context subsumption 1. For each literal l 1 C: a. If con lit (l 1, C, d) subsumes the literal context of exactly one literal l 2 D with l 1 µ = l 2, substitute C with µ. b. If l 1 s context cannot subsume the literal context of some literal l 2 D, then C cannot subsume D. c. If l 1 and its context match non-deterministically with literals in D, then add l 1 to C, the set of non-deterministically matching literals in C. 2. Start with the clause C substituted so far and use a backtracking algorithm to test if C θ D. Again, use the literal context to reduce the number of matching candidate literals in D for each literal in C. The complexity of testing for a con(d, k)-deterministically pair of subsuming clauses is O( D k C 2 D ). To see why, note that a deterministic match can be found in O( C 2 D ) and context inclusion can be tested in O( D k ) since the size of the context is limited by size k. In the worst case, each match additionally requires a context inclusion test making the complexity O( D k C 2 D ). Example 12: Let C = {p(x, y), p(y, z)} and D = {p(u, u), p(v)}. At depth 1, the context of p(x, y) is {p(x, y), p(y, z)}, and the context of p(y, z) is {p(y, z), p(x, y)}. The context of each literal in D is the literal itself, since they have no common variables i.e. 12

13 con lit (p(u, u), D, 1) = {p(u, u)} and con lit (p(v), D, 1) = {p(v)}. The literals in C can only match p(u, u) in D since they do not have the same arity as p(v). Additionally, the context of both literals in C subsume the context of p(u, u) and hence C subsumes D. 2.3 Clique and the general subsumption problem Another strategy in [SHW96] maps subsumption to the problem of finding the maximum clique of a substitution graph. Each vertex in the substitution graph, to test Cθ D, is a substitution which matches some literal in C, to one in D (see Definition 26). Definition 23: Clique [SHW96] A set of nodes C V is a clique of a graph G(V, E), iff C C E. In words, a clique is a set of mutually adjacent nodes in a graph. Definition 24: Strong compatibility [SHW96] Two substitutions θ 1 and θ 2 are strongly compatible iff θ 1 θ 2 = θ 2 θ 1. This implies that no variable is reassigned in θ 1 or θ 2. Example 13: The substitutions θ 1 = {X/a} and θ 2 = {X/b} are not strongly compatible since it is clear that the order of the substitutions will affect what X is replaced with. Similarly, θ 1 = {X/A}, θ 2 = {A/b} are not strongly compatible since X θ 1 θ 2 = b whereas X θ 2 θ 1 = A. Definition 25 [SHW96] uni(c, l i, D) = {µ l i C, l i µ D} is the set of all matching substitutions from a literal l i in clause C to some literal in D. Example 14: Let C = {p(x, Y)} and D = {p(a, b), p(c, d)}, then uni(c, p(x, Y), D) = {{X/a, Y/b}, {X/c, Y/d}}. It is clear that if we find the Cartesian product of uni for all literals in C, then if Cθ D, an element of this Cartesian product must be θ. Proposition 3 [Eis81] A clause C subsumes a clause D by Cθ D, iff there is an n-tuple n ( θ1,... θ n ) i= 1uni( C, li, D) where n = C, such that all θ i are pairwise strongly compatible. If we are to find a valid θ, such that Cθ D, then in the worst case we will need to enumerate D C substitutions. To see why, note that the size of uni(c, l i, D) for some literal l i C is at most D. By Proposition 3, θ must be an element of the set n uni( C, l, ) which has a size D C, since n = C. i= 1 i D 13

14 Scheffer et al. map subsumption to the clique problem by defining a substitution graph. The vertices of this graph are the set of substitutions from any literal in C to some literal in D. An edge exists in this graph if two substitutions are strongly compatible. Definition 26: Substitution graph [SHW96] Let C and D be clauses and n = C. Then G is the substitution graph of C and D iff V(G) = n ( uni( C, l i i, D), i) = 1 compatible and i j. Proposition 4 [SHW96] and ((θ 1, i),(θ 2, j)) E(G) iff θ 1 and θ 2 are strongly Let C and D be clauses. Then Cθ D with θ = θ 1 θ n iff there is a clique {θ 1,, θ n } of size C in the substitution graph of C and D. See [SHW96] for a proof. Example 15: Let C be the clause {p(a, B), p(q, B)} and D be the clause {p(x, y), p(y, z)}. The set of vertices in the substitution graph is {{A/x, B/y}, {A/y, B/z}, {Q/x, B/y}, {Q/y, B/z}} of which the pairs {({A/x, B/y}, {Q/x, B/y}), ({A/y, B/z}, {Q/y, B/z})} are strongly compatible and so form edges in the substitution graph (see Figure 4). Clearly the maximum clique of this graph is of size 2, which is equal to C and hence C subsumes D by Proposition 4. {A/x, B/y} {A/y, B/z} {Q/x, B/y} {Q/y, B/z} Figure 4: The substitution graph of C = {p(a, B), p(q, B)} and D = {p(x, y), p(y, z)} from Example 15. Scheffer et al. also outline Carraghan and Pardalos algorithm for finding the maximum clique of a graph [CP90]. They give a number of specialisations for this algorithm applied to subsumption (see [SHW96] for details). Algorithm 4 below is a simple procedure to find if clause C subsumes D using the maximum clique method. Algorithm 4: Maximum clique subsumption 1. Find the substitution graph, S, of clauses C and D. 2. Compute the maximum clique, M, of S a. If M is equal to C then C subsumes D b. If M is less than C then C does not subsume D The maximum clique of the substitution graph, S, can never be larger than C since substitutions for the same literal in C are not strongly compatible and so do not form 14

15 an edge in the substitution graph. Hence edges only exist between substitutions applied to different literals in C and so a clique can be no larger than C 2.4 Subsumption between k-local clauses Kietz and Lübbe perform similar experiments to Scheffer et al. in [KL94]. They compare plain, deterministic and k-local subsumption on an artificial dataset. We will briefly describe their k-local algorithm here. A local of a clause is a subset of its literals such that the variables in the local are disjoint from those in the rest of the clause. A local L is k-local iff k min( vars(l), L ) for some constant k, where vars(l) is the set of variables in L. A clause is k-local iff every non-determinate 3 local is a k-local. To test if some clause C subsumes D, each local of C has to be able to subsume D. If any local of C fails to subsume D, then C cannot subsume D. 3 Experimental setup Our experimental setup is consistent with Scheffer et al. [SHW96]. The aim was see if their results were consistent with ours. We ran plain, literal context and deterministic subsumption on the mesh dataset (described in Section 3.2). 3.1 Subsumption Code The subsumption algorithms were implemented in Sicstus Prolog. The top level subsumption predicates were called plain_subsume/2, det_subsume/2, and literal_subsume/3 corresponding to plain, deterministic and literal context subsumption respectively. Each of these predicates succeeds only if the given input clauses subsume each other. A clause was represented as a lists of terms in Prolog. The plain_subsume predicate uses a simple backtracking algorithm, which is easy to write in Prolog: plain_subsume(c, D):- numvars(d,0,_),!, subset(c,d),!. If we wish to test Cθ D, then the literals in D are first made ground using the numvars/3 predicate. We then attempt to unify each literal in C with a literal in D, using backtracking as necessary. This is achieved with the subset predicate, which tests if clause C is a subset of D. It is important that D is ground since when unifying a literal in C to one in D, we only want to apply a substitution to the literal in C. The top level deterministic subsumption predicate, det_subsume, is very similar to plain_subsume. The only difference is that we first find the deterministically matching literals in C, and then check if the remaining non-deterministically matching literals in C subsume D: det_subsume(c, D):- numvars(d, 0, _), det_subsume1(c, D, _Det, NDet),!, subset(ndet, D),!. 3 In C subsumes D, a non-determinate local of C is the local of the subset of literals in C, which nondeterministically match in D. 15

16 The det_subsume1 predicate separates the literals in C to those which deterministically and those which non-deterministically match in D. Then subset/2 checks if the non-deterministically matching literals in C, NDet, subsume D. Our literal_subsume(c, D, N) predicate checks if a clause C subsumes D using literal context subsumption at depth N. It first generates literal graphs for both clauses C and D and then separates literals in C to those that do and do not deterministically match in D according to the literal context criteria. Finally, it tests whether the nondeterministically matching literals in C subsume D, using literal context to reduce the number of matching literals in D for any literal in C. 3.2 The Mesh Dataset The stresses on physical objects can be modeled by representing them with finite elements. Choosing an appropriate resolution for the object typically requires expert knowledge. The mesh dataset [BM92] was generated for the application of Inductive Logic Programming in choosing the resolution of finite elements for an object. The complete mesh dataset can be found on the MLnet server [ML]. The mesh dataset consists of 10 example mesh models. Each model is described by a number of facts in the form mesh(e, N), where E is the name of an edge and N is the number of finite elements on that edge. Each edge has certain properties associated with it such as its load and type. An edge also has facts in the form neighbour(n, M) denoting that edge N is adjacent to edge M. Figure 5 shows some example mesh data. 3.3 Preparing the data The data used for our subsumption testing was generated by following the description in [SHW96] as closely as possible. Mesh clauses of depth n are generated using a mesh/2 literal as the head and the set of neighbour/2 literals at depth n from the head as the body. Figure 5 demonstrates how an example mesh clause is generated. mesh(a1,17). mesh(a2,1). mesh(a3,8). mesh(a4,1). mesh(a5,1). mesh(a6,2). neighbour(a1, a2). neighbour(a1, a4). neighbour(a1, a6). neighbour(a2, a1). neighbour(a2, a4). neighbour(a2, a5). neighbour(a2, a3). Figure 5: Some example data. A mesh clause with mesh(a1, 17) at the head and a depth of 1 has a body {neighbour(a1, a2), neighbour(a1, a4), neighbour(a1, a6)}. At depth 2, the neighbours of a2 can be reached and the new body becomes {neighbour(a1, a2), neighbour(a1, a4), neighbour(a1, a6), neighbour(a2, a4), neighbour(a2, a5), neighbour(a2, a3)}. The following process was used to generate clauses C in Cθ D. Clauses C were generated by finding all mesh clauses of depth d, 0 < d 3, giving clauses in total. The LGG of 500 pairs of randomly selected clauses in C were computed to form 500 clauses C. Since the size of the LGG of two clauses X and Y is O( X Y ), the body literals of all clauses C are randomly sampled so that they are at most the maximum body size of D (262 literals). The resulting clauses formed C. 16

17 The clauses D were generated by finding all mesh clauses of depth 6, giving 625 clauses in total. Each D clause contains approximately 130 literals, and every literal is ground (i.e. contains no variables). 3.4 Subsumption testing We tested Cθ D on randomly selected pairs of the possible pairs of clauses C and D. On average this resulted in about positive tests and 5000 negative tests. Scheffer et al. run their tests on 5000 to positive and negative examples, but time constraints meant we were unable to generate the required number of negative examples. The algorithms we ran were plain, deterministic and literal context subsumption at depths 1 and 2. For each algorithm we used varying sizes of clause C from 5 to 50 literals in steps of 5. To obtain a clause C of size n, only the first n literals were selected, with the first literal always as the head. The execution time of each subsumption test was recorded accurate to one hundredth of a second. To speed up the testing process for each configuration 4, the set of subsumption tests was split into 50 jobs of 1000 tests each and run on different machines. The Condor High-Throughput Computing system [CON] was used to distribute these jobs across Intel/Linux workstations. Condor executes a job on a particular workstation when there is no one logged onto that machine. 4 Results and Analysis The testing ran 20 different configurations and took approximately 87 days of computation time in total. Each Condor job ran on a different machine which often had a different CPU speed. Results were scaled to approximate the performance of a SPARCstation 20 machine, as these were used by Scheffer et al. Note however that they wrote their code in C, which in general runs faster than the equivalent Prolog code. It was rarely the case that at configurations with more than 5 literals, all subsumption tests would complete. Each configuration was run for at least 24 hours before it was aborted, otherwise testing would have taken far longer than the time available. Subsumption Algorithm Rank Rank in [SHW96] Plain 4 4 Deterministic 3 3 Literal context (d=1) 2 1 Literal context (d=2) 1 2 Table 1: Approximate rank of subsumption algorithms in the positive case for both our results and those in [SHW96]. In contrast to Scheffer et al., our results are less accurate in the negative case as our data gave fewer negative examples than positive ones. As the number of literals increased, we were able to run fewer subsumption tests in a 24 hour period and so our results became less accurate. As noted by Scheffer et al., the variance on the mean subsumption time was in general high and long running subsumption tests significantly affected the mean. Most individual tests would take under a second to 4 By configuration, we mean a particular subsumption algorithm with a particular size of clause C, for Cθ D. 17

18 complete, but rare cases would require hours. At the end of each 24 hour test period, it was more likely that long running subsumption tests would need to be cancelled. For positive examples, the mean subsumption times are shown in Figure 6, and the rankings of the subsumption algorithms in Table 1. As expected, plain subsumption performs the worst and the mean time per subsumption test rises rapidly with the number of literals. Deterministic subsumption fares better because it matches deterministic literals first, although both become too slow for accurate results with more than 20 literals. Literal context gives an improvement still over deterministic subsumption probably because more literals can be matched deterministically. Literal context at depth 2 is faster than at depth 1 because each literal in C has a larger context with which to reduce the number of matching literals in D. This makes it more likely that a particular literal in C matches only one literal in D. Although not apparent in Figure 6, literal context is actually slower than plain subsumption on 5 literals due to the initial cost of calculating the literal graph of each clause. Mean time (s) Plain Deterministic Literal context (d=1) Literal context (d=2) Number of literals Figure 6: Mean time for a positive subsumption test. Error bars are shown on the plain subsumption curve denoting the 95% confidence interval on the mean. The results from Scheffer et al. are shown in Figure 7. If we ignore our plain subsumption result at 20 literals, which is shown to have a high estimation error, then our results are in the order of 10 times slower than these results. We can put this difference down to Prolog code being slower than the equivalent C code, and inaccuracies in estimating the relative speeds of SPARCstation 20 and Intel workstations. With less than 20 literals, both sets of curves grow relatively slowly, but at 20 literals there is a large increase in the mean time for plain and deterministic subsumption. For literal context subsumption, the mean time grows slowly up to about 25 literals. In contrast to Scheffer et al. we did not notice a large increase in subsumption time with literal context subsumption at 30 literals. Scheffer et al. also find that literal context subsumption performs better at depth 1 than at depth 2, whereas we find the opposite is true. 18

19 Figure 7: Mean time for a positive subsumption test. These results are those obtained in [SHW96]. As mentioned earlier, the negative results (Figure 8) are less accurate that the positive ones. Looking at the ranking of algorithms in Table 2, we can see for negative examples deterministic subsumption performs the best, with a mean time under 20 milliseconds. We can explain our results by assuming our negative examples are identified because, in testing Cθ D, a literal in C cannot match any literal in D. This would clarify why deterministic subsumption appears so fast. It also explains why literal context at depth 1 is slower than deterministic subsumption. The literal context criterion means it takes longer to identify whether literals in C match deterministically in D. Literal context subsumption at depth 2 is slower than at depth 1, because each time we check if a literal in C matches one in D we need to check if the context of the literal in C subsumes the context of the literal in D. At depth 2, this context is likely to be larger than at depth 1 and hence the algorithm becomes slower. The mean time for a negative test using plain subsumption shows odd behaviour in Figure 8 as it decreases at 20 literals. This result is inaccurate since only 41 negative tests were able to run and long tests were likely to be cancelled before completion. Subsumption Algorithm Rank Rank in [SHW96] Plain 4 4 Deterministic 1 3 Literal context (d=1) 2 Unknown Literal context (d=2) 3 Unknown Table 2: Approximate rank of subsumption algorithms in the negative case for both our results and those in [SHW96]. Scheffer et al. s results in Figure 9 show that literal context subsumption has a mean negative subsumption time of fewer than 3 seconds. Our literal context algorithm has 19

20 a subsumption time of less than 10 seconds at depth 1 and less than 60 seconds at depth 2. Both sets of literal context curves show very little growth, probably because negative examples are identified by the context of a literal in C not subsuming the context of any literal in D, in testing Cθ D. We suspect the discrepancies between the deterministic subsumption curves is due to Scheffer et al. s data containing more examples where there is a need to check if non-deterministically matching literals in C subsume D. This would require an invocation of plain subsumption. Mean time (s) Plain Deterministic Literal context (d=1) Literal context (d=2) Number of literals Figure 8: Mean time for a negative subsumption test. Error bars are shown on the literal context (d=2) curve denoting the 95% confidence interval on the mean. Figure 9: Results of [SHW96] of the mean time for a negative subsumption test. 20

21 In general, if we consider the inaccuracies in estimating the speed of the workstations used in [SHW96] and the relative speeds of Prolog and C, our results are consistent with those of Scheffer et al. The only exception is deterministic subsumption in the negative case, which would need to be investigated with more examples. 5 Conclusions and Future Work We have presented the θ-subsumption problem, which is critical to the efficiency of ILP learners. Although an NP-Complete problem, [SHW96] has demonstrated several efficient subsumption algorithms in the average case. We have seen these algorithms are based on mapping subsumption to graph problems. In particular, Scheffer et al. map subsumption to the graph isomorphism and maximum clique problems. Kietz and Lűbbe [KL94] describe an additional algorithm based on finding sets of literals with disjoint variable sets, called locals, and subsuming each local independently. We have implemented and tested plain, deterministic and literal context subsumption on the mesh dataset. All of the algorithms in [SHW96] gave an improvement on average over plain subsumption. For positive subsumption tests, literal context at depth 2 is the fastest algorithm followed by literal context at depth 1 and deterministic subsumption. In the negative case, deterministic subsumption outperforms all other algorithms, although this is likely to be caused by the small number of negative examples we use. Scheffer et al. observe considerably worse performance in their tests with deterministic subsumption. All algorithms suffer from large variances in the time for each subsumption test. Our results are consistent with those in [SHW96], except for deterministic subsumption in the negative case. Improvements to the current experimental setup could have been achieved by running the tests in a more controlled environment. Since Condor distributes jobs to ordinary workstations, jobs were sometimes lost when users restarted machines. A dedicated compute cluster would have given less experimental error and wasted CPU time. It may also have been worthwhile to use a smaller literal step size to compensate for the mean subsumption time increasing exponentially. To improve results further, a clause reduction algorithm could be run on each clause before every subsumption test. A clause C is reduced iff there does not exist a clause D such that D C and Cθ D. For example, let C be the clause {f(a, X), f(a, Y), f(b, c)}. Then this clause is not reduced since the clause D = {f(a, X), f(b, c)} is a subset of C and Cθ D by θ = {Y/X}. Note that clause reduction itself requires subsumption tests. Interesting future work includes comparing the performance of the maximum clique and graph context algorithms to those tested here. The results in [SHW96] show that maximum clique combined with the graph context algorithm is significantly faster than other subsumption algorithms. Scheffer et al. report that the mean time in the positive case with this algorithm is only 1 second. Further, they suggest that this algorithm can in turn be combined with the k-local algorithm. After a suitable evaluation of these algorithms, it is hoped that the fastest can be integrated into existing ILP systems. 21

22 6 References 6.1 Books, Journal articles and Conference papers [BM92] B. Dolsak and S. Muggleton, The Application of Inductive Logic Programming to Finite Element Mesh Design, Inductive Logic Programming, Academic Press, pp , 1992 [CP90] R. Carraghan and P. Pardalos, An exact algorithm for the maximum clique problem, Operations Research Letters, vol. 9, no. 6, pp , 1990 [Eis81] N. Eisinger, Subsumption and connection graphs, In Proc. IJCAI, 1981 [H96] [KL94] J. L. Hein, Theory of Computation: An Introduction, Jones and Bartlett Publishers, 1996 J. U. Kietz and M. Lűbbe, An efficient subsumption algorithm for inductive logic programming, In Proc. International Conference on Machine Learning, 1994 [M97] T. Mitchell, Machine Learning, McGraw Hill, New York, USA, 1997 [P70] G. D. Plotkin, A Note on Inductive Generalisation, In B Meltzer and D. Michie, editors, Machine Intelligence, volume 5. pp , 1970 [SHW96] T. Scheffer, R. Herbrich, and F. Wysotski, Efficient θ-subsumption Based on Graph Algorithms, In Proc. Int. Workshop on Inductive Logic Programming, 1996 [Rob65] [UW81] J. A. Robinson, A machine-oriented logic based on the resolution principle, J. ACM, vol. 12, num. 1, pp , 1965 S. Unger and F. Wysotzki, Lernfähige Klassifizierungssysteme, Akademie Verlag Berlin, 1981 [WSK81] F. Wysotzki, J. Selbig and W. Kolbe, Concept learning by structured examples an algebraic approach. In Proc. of the 7 th International Joint Conference on Artificial Intelligence, Web References [ML] MLnet, Available: [23 December 2002] [CON] Condor, Computer Sciences Department, University of Wisconsin- Madison, Available: [23 December 2002] 22

23 7 Glossary θ-subsumption A clause C θ-subsumes D, written C θ D, if and only if there exists a substitution θ such that Cθ D and C D. θ-subsumption is an incomplete, decidable consequence relation. Clause A clause is a disjunction of literals whose variables are universally quantified. For example A(r(b) g(a) x) is a clause. Completeness A logical system is complete if and only if every true statement expressible by the system can be proved as theorems. Computational Complexity Computational Complexity is the study of the computational resources required by an algorithm. The worst case complexity of an algorithm is expressed as O(f(n)), where f(n) is the rate of growth of resources required by the algorithm for large n. Consequence relation A relation R defined between C and D, given by C R D is a consequence relation if C R D then C D. Conjunction The conjunction of A and B, written A B, is true if both A and B are true. Datalog clause A datalog clause is a specialisation of a clause such that all terms in the clause, which we will call datalog terms, are restricted to variables or functions of any datalog term. Decidability A logical system is decidable if and only if there exists a procedure that will yield an answer in a finite amount of time to the question Is this argument valid or invalid? for every argument expressible in that logical system. Disjunction The disjunction of A and B, written A B, is true if either or both of A and B are true. Entailment A clause α entails β, written α β, iff α β is a tautology. In words, for all cases that α is true, β is also true. 23

24 Horn Clause A Horn clause is a clause which contains at most one positive literal. A Horn clause which has exactly one positive literal (called a definite clause) can be written in the form: H (L 1 L n ) where H, L 1 L n are positive literals, H is called the head and L 1 L n are called the body literals of the clause. Inductive Logic Programming Inductive Logic Programming is concerned with the induction of Horn clauses from examples and background knowledge. Intractability An intractable problem is one that is not in the class P. This means that it cannot be solved in polynomial time by any Turing complete machine. Literal A literal is any predicate or its negation applied to any set of terms. Examples include human(john), parent(ian, bob), wheels(car, 4). NP-Complete A decision problem is in NP if a solution can be found non-deterministically in polynomial time. A problem in NP is NP-Complete if every other decision problem in NP can be polynomially transformed to it. Term A term is a constant, variable or function applied to any term. Examples include x, height(house), s(s(0)). 24

25 8 Appendix A Subsumption testing results Literals Tests Mean time for each test (s) Variance (s 2 ) Total time (s) 95% confidence interval of mean Table 3: Results on positive examples using plain subsumption. Literals Tests Mean time for each test (s) Variance (s 2 ) Total time (s) 95% confidence interval of mean Table 4: Results on negative examples using plain subsumption. Literals Tests Mean time for each test (s) Variance (s 2 ) Total time (s) 95% confidence interval of mean Table 5: Results on positive examples using deterministic subsumption. Literals Tests Mean time for each test (s) Variance (s 2 ) Total time (s) 95% confidence interval of mean Table 6: Results on negative examples using deterministic subsumption. Literals Tests Mean time for each test (s) Variance (s 2 ) Total time (s) 95% confidence interval of mean Table 7: Results on positive examples using literal context subsumption at a depth of 1. 25

Automatic Reasoning (Section 8.3)

Automatic Reasoning (Section 8.3) Automatic Reasoning Can reasoning be automated? Yes, for some logics, including first-order logic. We could try to automate natural deduction, but there are many proof