Learning Coni uiict ions of Horn Cla,uses

Learning Coni uiict ions of Horn Cla,uses (extended abstract) Da.na Angluin* Computer Science Yale University New Haven, CT 06520 angluin@ cs. y ale. edu Michael F razier Computer Science University of Illinois Urbana, Illinois 61801 mfrazier @cs. uiuc. edu Leonard Pittt Computer Science University of Illinois Urbana, Illinois 61801 pitt@cs.uiuc.edu Abstract An algorithm is presented for learning the class of Boolean formulas that are expressible as conjunctions of Horn clauses. (A Horn clause is a disjunction of literals, all but at most one of which is a negated variable.) The algorithm uses equivalence queries and membership queries to produce a formula that is logically equivalent to the unknown formula to be learned. The amount of time used by the algorithm is polynomial in the number of variables and the number of clauses in the unknown formula. 1 The Problem A central question in the theory of concept learning is to determine those types of Boolean formulas that can be learned in polynomial time. To date, relatively few algorithms for learning restricted classes of Boolean formulas are known, although a number of learning algorithms have been given for other types of concepts (see, for example, many of the papers in [9, 131). Algorithms that have been given for learning restricted classes of Boolean formulas include algorithms for learning monomials (pure conjunctive concepts) [11, 141, internal disjunctive concepts [6], readonce formulas [4], monotone DNF formulas [2, 141, kcnf and kdnf formulas for constant k [5, 11, 151. These algorithms use a variety of different natural learning protocols. In this paper we extend this colledion of positive results by giving a polynomial-time algorithm that learns the class of Horn sentences using equzvalence and men1 bershzp queries. Supported by NSF Grant IRI-SilS975. +Supported by NSF Grant IRI-S8095iO and by the Department of Computer Science, University of Illinois at UrlJana- Cliampaign. Let V = vi,..., U, be a set of Boolean variables. A literal is either a variable vi or its negation -vi. A clause over variable set V is a disjunction of literals. A Horn clause is a clause in which at most one literal is unnegated. A Horn sentence is a conjunction of Horn clauses. The class of Horn sentences over variable set I/ is a proper subclass of the class of Boolean formulas over V. Let H, denote the target Horn sentence to be learned. The following protocol is used: The learning algorithm may propose as a hypothesis any Horn sentence H by making an equivalence query to an oracle. If H is logically equivalent to H, then the answer to the query is yes and the learning algorithm has succeeded in the inference task, and halts. Otherwise, the answer to the equivalence query is no, and the learning algorithm receives a counterexample - an assignment T : V + {true,fulse} that satisfies H, but does not satisfy H (a positive counterexample), or vice-versa (a negatzve counterexample). A membership query is any assignment 2 to the variables, and the answer to the membership query is yes if 2 satisfies the target formula H,, and no otherwise. The algorithm runs in time U(m3n4), making O(m2n2) equivalence queries and O(m2n) membership queries, where m is the number of clauses, and n is the number of variables of H,. In the full paper, by employing a carefully chosen representation for Horn sentences, these bounds are improved to O(m2n2), O(mn), and O( m2 n), respectively. It. is interesting to note that both types of queries are necessary for learning Horn sentences. Angluin [a] shows that membership queries alone are insufficient for polynomial-time learning, and, implicitly in [l], she proves that equivalence queries alone a.re also insufficient. The 00, or soft-o, notation is similar Lo the usual 00 1lotat.ioii except that 00 ignoi.es logarithmic factors. CH2925-6/90/0000/0186$01.00 Q 1990 IEEE 186

A similar result for learning monotone formulas in disjunctive normal form (DNF) has been given [2, 141 The dual of the class of Horn sentences is the class of almost monot,one DNF formulas - a disjunct of t,erms, where each t,erm is a conjunct of literals, at most one of which is negated. Since our algorithm is easily modified to handle the dual class, it extends the results in [2, 141 by allowing a small amount of nonmonotonicity. (In the conclusion we indicate why allowing more nonmonotonicity would yield a difficult problem.) Horn sentences are an interesting nontrivial subclass of CNF (dual: DNF) formulas, the learnability of which remains a central open problem for the distribution-independent ( PAC-learning ) model of Valiant [14]. The research presented here also improves the results in [3], where the class of Horn sentences is shown to be learnable by an algorithm that uses equivalence queries that return Horn clauses as counterexamples and derivation queries - a type of query that is significantly more powerful than a membership query. The relationships between our results and previous work are explained in more detail in the full paper. By modifying the algorithm presented here in a relatively straightforward way [a] we could obtain an algorithm that learns the class of Horn sentences using randomly generated examples as in the PAC-learning model, provided that the algorithm is additionally allowed to make membership queries. Similarly, the algorithm presented here could be used in an on-line setting in which the learning algorithm is to classify each of a succession of examples, and the algorithm is told whether its classification is correct or incorrect before receiving each next example. The resulting on-line algorithm makes membership queries (excluding the examples to be classified) but not equivalence queries, and is guaranteed to make at most a polynomial number of errors of classification regardless of the sequence of examples [2]. Note that because the problem of determining whether two Horn sentences are equivalent (and producing a counterexa.mple if they are not) is solvable in polynomial time, the oracle in our learning protocol could be replaced by a teacher with polynomially bounded computational resources. 2 Preliminaries It is often easier to discuss the sa tisfaction or falsification of a Horn clause when- that clause is represented as an implication. To expedite the discussion we will implicitly assume that all Horn clauses are represented as implications. This necessitates the introduction of two logical constants Definition 1 The logical constant true is represented by T and the logical constant false is represented by F. Next, we introduce notation that will enable us to dissect Horn clauses and discuss the relationships between them and examples. First, recall the identity zvvf=, -wi G vi) ~ z where, 3 is the logical connective for implication and E is a metasymbol in- dicating logical equivalence. Now, taking A:=l U; = T (the empty conjunction evaluates to true) and adopt- 0 ing the convention that we write Ai =, v, 3F when there are no unnegated variables in the Horn clause, we have the following definitions. Definition 2 Let H be any Horn sentence over V. An example is any assignment x : V 4 {true, false}. A positive (respectively, negative) example for H is an. assignment x such that H evaluates to true (respectively, false) when each variable v in H is replaced by 4.). Definition 3 Let x be an example; then true(x) is the set of variables assigned the value %we by x and false(x) is the set of variables assigned the value false by x. By convention, T E true(x) and F E false(z). Definition 4 Let C be a Horn clause. Then antecedent (C) is the set of variables that occur negated in C. If C contains an unnegated variable z, then consequent (C) is just z. Otherwise, C contains only negated variables and consequent(c) as F. We now describe the relationships that may exist between an example and a Horn clause. Definition 5 An example x is said to cover a Horn clause C if antecehent(c) c true(x). We say that x does not cover C if antecedent(c) true(x). The example x is said to violate the Horn clause C if x covers C and consequent(c) E false(x). Notice that if x violates C then x must cover C, but that the converse does not necessarily hold. It will be more convenient throughout the rest of the paper to consider a Horn sentence as a set of Horn clauses, representing the conjunction of the clauses. Our first observation is trivial, but it is helpful to state it. formally. Proposition 1 If x is a negatzve example for the Horn sentence I, then x violates some clause of H.

We next define the n operation for examples. Definition 6 Let x and s be two examples, then xns is defined to be the example z such that true(z) = true(x)ntrue(s). Note that this implies that fa/se(xns) is false (x)ufalse(s). Lemma 1 Let x and s be examples. If x violates C and s covers C, then xns violates C. Proof: Ifs covers C then antecedent(c) true(s). Also, if x violates C, then antecedent(c) C Irue(x) and consequent(c) E false(x). Thus antecedent(c) C true(s)n true(x) = true(snx) and consequent(c) E false(x) false(s)ufalse(x) = false(snx). Thus, snx violates C. 0 Corollary 1 Let x and s be ezamples. If x violates C and s violates C, then xns violates C. Proof: Apply lemma 1 after noting that ifs violates C then it also covers C. 0 Lemma 2 If x does not cover C, then for any example s, xns does not violate C. Proof: If antecedent(c) true(z) then aniecedent(c) true(x)ntrue(s) = true(xns). Thus Xns does not violate C. 0 Lemma 3 If xns violates C, then at least one of x and s violates C. 1 Set S to be the empty sequence /* s, denotes the 1-th element of S. */ 2 Set H to be the empty hypothesis 3 UNTIL equivalent( H) returns yes DO 4 BEGIN /* main loop */ 5 Let z be the counterexample returned by the equivalence query 6 IF 2 violates at least one clause of H 7 THEN /* x is a positive example */ 8 remove from H every clause that 3: violates 9 ELSE /* x is a negative example */ 10 BEGIN 11 FOR each s, in S such that trse(s,fir) is properly contained in true(s,) 12 BEGIN 13 query member(s,nz) 14 END 15 16 17 18 19 ENDIF 20 IF any of these queries is answered no THEN let a be the least number such that member(s,nz) was answered no refine s, by replacing s, with s,nx ELSE add I as the last element in the sequence S Set H to be AsCS claases(s), where clauses(s) = {(/\vctrue(s) v)*z : z E farse(s)> 21 END 22 ENDIF 23 END /* main loop */ 24 Return H Figure 1: Algorithm for Learning Horn Sentences whenever a new negative counterexample x is obtained. There are two problems with this approach. The 3 The Algorithm Let H, be the target Horn sentence with respect to which equivalence and membership queries are answered. The algorithm is based on the following ideas. Every negative example x violates some clause C of H,. From x we would like to add the clause C to our current hypothesis, but we cannot exactly determine C from x alone. We know however that antecedenl(c) C true(x), and consequent(c) E false(x). Thus one approach would be to add to our current hypothesis H all elements of the set clauses(x) = {( A v)*z : i E fnlse(x)> u~ true(z ) of the elements of clauses(x) that is added is (A\vEtrue(z) v)jconsequent(c), where C is some clause of H, that x violates. This clause may be less restrictive than C because its antecedent may be more restrictive. Thus, the negative examples that fail to satisfy this clause may be only a small fraction of those that fail to satisfy C, and the clause added to H is only an approximation of C. Very many such approximations to the target clause C may be generated by the examples. This latter problem is dealt with by searching for a 2The clause (l\vetrue(r) v)*f is in the set c/auses(z), since by convention F E fa/se(x). This particular clause is meant to cover the case where the clause C that is violated by z cont.ains no unnegated variables. 188

set of negative examples with as few true variables as possible - these correspond to better approximations to cla.uses of H,. A new negative example is used to attempt to refine previously obtained negative examples by intersection - each such intersection, if it actually contains fewer true variables than the previously obtained negative example, is then tested to see whether it is negative (using a membership query.) If so, it is a candidate to refine the previously obtained negative example. The algorithm maintains a sequence S of negative examples. Each new negative counterexample is used to either refine one element of S, or is added to the end of S. In order to learn all of the clauses of H,, we would like the clauses induced by the (negative) examples in S to approximate distinct clauses of H,. This will happen if the examples in S violate distinct clauses of H,. Overzealous refinement may result in several examples in S violating the same clause of H,. To avoid this, whenever a new negative counterexample could be used to refine several examples in the sequence S, only the leftmost among these is refined. 4 Correctness and Running Time First observe that the algorithm terminates only if the hypothesis and the target Horn sentence H, are equivalent. Therefore, if the algorithm terminates, it is correct. To show termination in polynomial time we first prove a couple of technical lemmas. Lemma 4 For each execution of the main loop of line 3, the following holds. Suppose that in step 5 of the algorithm a negative example x is obtained such that for some clause C of H, and for some s; E S, x violates C and si covers C. Then there is some j 5 i such that in step 17 the algorithm will refine sj by replacing sj with sj nx. Proof: The proof is by induction on the number of iterations k of the main loop of line 3. If k = 1, then the lemma is vacuously true, since the sequence S is empty upon execution of step 5. Assume inductively that the lemma holds for iterations 1,2,..., I; - 1 of the main loop, and assume that during the k-th execution of the loop, at step 5 a negative example I is obtained such that for some clause C of H, and for some s; E S, x violates C and si covers C. Clearly, if in step 17 of t,he k-th iteration, the algorithm refines some sj where j < i, then we are done. Suppose that t,his does not happen. Now by lemma 1, we know tha.t sinx is a negative example. It only remains to be shown that frue(sinz) is properly contained in true(si), for then si will be refined in step 17. Observe that each time the sequence S is modified, step 20 of the algorithm discards the old hypothesis and constructs a new hypothesis H from the elements currently in S. Further observe that during each execution of the main loop of line 3, either S is modified (lines 9-21), or else a clause is removed from H (line 8). Let j < k be the last execution of the main loop of line 3 during which S was modified. Then, during the j-th iteration, line 20 was executed and was reconstructed from S. At this time a clause C = (AvEtrue(r,)~) jconsequent(c) was included in H, where C is the clause-that and si both violate. Now C logically implies C, so C could not have been removed in line 8 during iterations j + 1,..., k of the main loop. Since the equivalence query only returns examples in the symmetric difference between the hypothesis H and the target H,, a negative example obtained in line 5 satisfies every clause of H. By assumption, z violates C, thus consequent(c) E false(s). But now if Irue(si) true(z), then I would violate 2, a contradiction. Therefore, true(sinz) is properly contained in true(si). Thus the algorithm will replace s; by sin2 in line 17. 0 Lemma 5 Let S be a sequence of elements constructed for the target H, by the algorithm. Then 1. VkV(i < k)v(c E H,) if sk violates C then si does not cover C 2. dkv(i # k)v(c E H,) if Sk violates c, then si does not violate C. Proof: The proof is by induction. We will show that properties 1 and 2 are preserved under any modifications the algorithm makes to the sequence S. Initially the sequence is empty, so both properties hold vacuously. Now suppose that the properties hold for some sequence, and suppose that the algorithm modifies the sequence in response to seeing the negative example x. If the algorithm appends x to the sequence as, say, st, then suppose by way of contradiction that property 1 fails to hold. Inductively, the only way that property 1 could now fail to hold is if there is some i < t such that s, covers some clause C of H, that st violates. But this means s,nx violates C. This together with lemma 4 contradicts the fact that the algorithm did not replace sj by sjn2 for some j 5 i. Thus property 1 is preserved. Now suppose by way of contradiction that property 2 fails to hold. Inductively, the only way property 2 could now fail to hold is if there is some i < t such that 5, and st both violate some clause C of H,. Since s, 189

violates C it also covers C. Then, by lemma 4, some sj with j 5 i would have been refined instead of x = st being added to S, a contradiction. Thus property 2 is preserved. Now suppose that instead of appending x to the sequence, the algorithm replaces some sk with Sknz. Suppose by way of contradiction t.hat property 1 fails to hold. There are two possibilities, either there is some i < k such that si covers and sknx violates some particular clause C of H, or there is some i > k such that Sknz covers and si violates some particular clause C of H,. If the former case holds, then by lemma 3 either z violates c or sk violates c. If x violates c then (since s; covers C) by lemma 4 there must be some j 5 i-< k such that s, was refined instead of sk, a contradiction. On the other hand, if sk violates c, then the fact that si and sk both violate C contradicts the inductive assumption that property 2 held before the modification. Now consider the latter possibility, namely that there is some i > k such that sknz covers and s; violates some clause C of H,. Then by (the contrapositive of) lemma 2, sk covers C. Since si violates C and i > k, this contradicts the inductive assumption that property 1 held before the modification. Thus, property 1 is preserved. Finally, suppose that the algorithm replaces some sk with sknx and suppose by way of contradiction that property 2 no longer holds. If this is the case, then there is some i # k such that si and sknx both violate some particular clause C of H,. By lemma 2, sk covers c. Further, by lemma 3, at least one of then the sk and z must violate c. If sk violates c, inductive assumption that property 2 held before the modification is contradicted by the fact that si also violates C. On the other hand, suppose x violates C. If i > k, then the fact the sk covers C contradicts the inductive assumption that property 1 held before the modification. If i < IC, then lemma 4 and the fact that si violates (and hence covers) C contradicts the fact that the algorithm did not replace sj by sj nz for some j 5 i. Thus, property 2 is preserved. 0 Corollary 2 At no time do two distinct elements in S violate the same clause of H,. Proof: This is property 2 of lemma 5. Lemma 6 Every element of S violates at least one clause of H,. Proof: Each of the elements in S is a negative example, thus by proposition 1, each of t.he elements violates some clause of H,. 0 Lemma 7 If H, has m clavses, then at no time are there more than m elements an the sequence S. Proof: This follows immediately from the fact that each of the elements in S violates some clause of H, but no two elements violate the same clause of H,. 0 Finally, we have our theorem. Theorem 1 A Horn sentence consisting ofm clauses over n variables can be learned exactly in time 0(m3n4) using O(m2n2) equivalence queries and O(m2n) membership q~eries.~ Proof: The only changes to the sequence S during any run of the algorithm involve either appending a new element to S, or refining an existing element. Thus IS1 cannot decrease during any execution of the main loop of the algorithm. But lemma 7 shows that there are at most m elements of S a.t any time. Thus line 18 is executed at most m times. Now observe that whenever any element si of the sequence S is refined (line 17), the resulting new i-th element is s, n x, which, by line 11, must contain strictly fewer variables assigned the value true than si. This can happen at most n times for each element of S. Thus line 17 is executed at most nm times. Whenever the ELSE clause at line 9 is executed, either line 17 or 18 must be executed. It follows that lines 9-21 are executed at most nm + m = (n + 1)m times. Note that this bounds the total number of membership queries made by (n + l)m2. Next observe that for any element s of S, the cardinality of false(s) is at most n + 1 (recalling that F E false(s)). Thus the cardinality of clauses(s) is at most nfl. Therefore, the number of clauses in any hypothesis H constructed in line 15 is at most (n + 1)m. Now, since each positive counterexample obtained in line 5 necessarily causes at least one clause to be removed from H by line 8, the equivalence query can produce at most (n+ 1)m positive counterexamples be- tween modifications to S. Therefore, line 8 is executed at most (n + 1) times. Since each execution of line 3 that does not result in termination causes execution of line 8 or lines 9-21, the total number of executions of line 3 (and hence the total number of equivalence queries made) is at most (n + 1)2m2 + (n + 1)m + 1. To complete the proof we need only show that the time needed for each execution of the main loop is 6(n2m). Using the facts (above) that at any time during the execution of the algorithm IS1 5 m and (HI 5 (n f l)m, and that each element of N consists 3These bounds are improved in the fnll paper to d(m2n2) time, O(mn) ec~iiivalence queries, and O(m2n) membership queries.

of at most n + 1 variables (antecedent + consequent), it is easily verified that th_e time needed to execute either of steps 8 and 20 is O(n2m), and that these steps dominate the time to execute one itera.tion of the main loop. 0 5 Conclusions A polynomial-time algorithm for learning Horn sentences using equivalence and membership queries was presented. By the results of Angluin [l, 21 neither type of query alone is sufficient to allow exact learning in polynomial time. The algorithm may be used to obtain an algorithm for PAC-learning or polynomial prediction [8, 121 of Horn sentences from randomly generated examples, provided that membership queries are also available to the algorithm. If membership queries are not available, it is an open problem whether Horn sentences are PAC-learnable or polynomial-time predictable (from random examples alone.) By the reductions of Kearns, Li, Pitt, and Valiant [lo] PAC-learnability of Horn sentences would imply PAC-learnability of general CNF and DNF sentences, and similarly for polynomial predictability. It is also an open problem whether general CNF or DNF formulas are PAC-learnable or polynomially predictable on randomly generated examples when membership queries are available. For any k, let k-quasi- Horn be the class of CNF formulas where each clause contains at most k unnegated literals. Thus l-quasi- Horn is just the class of Horn sentences, and is learnable using equivalence and membership queries. We have shown using prediction-preserving reductions [12] that if our algorithm could be extended to learn the class of 2-quasi-Horn formulas using equivalence and membership queries then the general class of CNF formulas (and DNF formulas) would be polynomially predictable by an algorithm that uses membership queries and randomly generated examples. Finally, we are currently investigating the possibility of extending the algorithm here to handle restricted types of universally quantified Horn sentences (see the papers of Valiant [15] and Haussler [7] for related classes of formulas). This class is of significant interest due to its similarity to the language Prolog, and its use in logic programming and expert system design. References [l] D. Angluin. Negat,ive results for equivalence queries. Technical Rep or t YALE / D C S / RR- 648, Department of Computer Science, Yale University, September 1988. To appear, Machine Learning. A preliminary version appears in the Proceedings of the 1989 Workshop on Computationa.1 Learning Theory. D. Angluin. Queries and concept learning. Machine Learning, 2:319-342, 1988. D. Angluin. Requests for hints that return no hints. Technical Report YALE/DCS/RR-647, Department of Computer Science, Yale University, 1988. D. Angluin, L. Hellerstein, and M. Karpinski. Learning read-once formulas with queries. Technical report, University of California at Berkeley, Report No. 89/528, 1989. (Also, International Computer Science Institute Technical Report TR- 89-050.) A. Blumer, A. Ehrenfeucht, D. Haussler, and M; Warmuth. Learnability and the Vapnik- Chervonenkis dimension. J. ACM, 36(4):929-965, October 1989. [GI D. Haussler. Quantifying inductive bias: AI learning algorithms and valiant,'^ learning framework. A rt zfic ia I Int e llig e n ce, ( 36) : 177-22 1, 1988. [7] D. Haussler. Learning conjunctive concepts in structural domains. Machine Learning, 4( 1):7-40, October 1989. [8] D. Haussler, N. Littlestone, and M. K. Warmuth. Predicting (0,l) functions on randomly drawn points. In Proceedings of the 29th Annual IEEE Symposium on Foundations of Computer Science, pages 100-109, Washington, D.C., October 1988. IEEE Computer Society Press. [9] D. Haussler and L. Pitt, editors. Proceedings of the 1988 Workshop on Computational Learning Theory. Morgan Kaufniann, San Mateo, CA, 1988. [lo] M. Kearns, M. Li, L. Pitt, and L. G. Valiant. On the learnability of Boolean formulae. In Proceedings of the 19th Annual ACMSymposium on Theory of Computing, New York, May 1987. ACM. [ll] N. Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 2:285-318, 1987. [12] L. Pitt and M. I(. Warmuth. Prediction preserving reducibility. Technical Report UCSC- CRL-88-26, Universit,y of California, Santa Cruz, 191

November 1988. Preliminary version appeared in Proceedings of the 3rd Annual IEEE Conference on Structure in Complexity Theory, pages 6049, June 1988. To appear in J. Comput. Sys. Sci. [14] L. G. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134-1142, 1984. [I51 L. G. Valiant. Learning disjunctions of conjunctions. In Proceedings of the 9th International [13] R. L. Rivest,! D. Haussler, and M. K. Warmuth, Joint Conference on Artificial Intelligence, Vol. editors. Proceedings of the 1989 Workshop 011 I, pages 560-566, Los Angeles, California, August Computational Learning Theory. Morgan Kauf- 1985. mann, San Mateo, CA, 1989.