Introduction to Rule-Based Systems. Using a set of assertions, which collectively form the working memory, and a set of

Size: px

Start display at page:

Download "Introduction to Rule-Based Systems. Using a set of assertions, which collectively form the working memory, and a set of"

Jasmin Eaton
6 years ago
Views:

1 Introduction to Rule-Based Systems Using a set of assertions, which collectively form the working memory, and a set of rules that specify how to act on the assertion set, a rule-based system can be created. Rule-based systems are fairly simplistic, consisting of little more than a set of if-then statements, but provide the basis for so-called expert systems which are widely used in many fields. The concept of an expert system is this: the knowledge of an expert is encoded into the rule set. When exposed to the same data, the expert system AI will perform in a similar manner to the expert. Rule-based systems are a relatively simple model that can be adapted to any number of problems. As with any AI, a rule-based system has its strengths as well as limitations that must be considered before deciding if it s the right technique to use for a given problem. Overall, rule-based systems are really only feasible for problems for which any and all knowledge in the problem area can be written in the form of if-then rules and for which this problem area is not large. If there are too many rules, the system can become difficult to maintain and can suffer a performance hit. To create a rule-based system for a given problem, you must have (or create) the following: 1. A set of facts to represent the initial working memory. This should be anything relevant to the beginning state of the system. 2. A set of rules. This should encompass any and all actions that should be taken within the scope of a problem, but nothing irrelevant. The number of rules in the system can affect its performance, so you don t want any that aren t needed. 3. A condition that determines that a solution has been found or that none exists. This is necessary to terminate some rule-based systems that find themselves in infinite loops otherwise. Theory of Rule-Based Systems The rule-based system itself uses a simple technique: It starts with a rule-base, which contains all of the appropriate knowledge encoded into If-Then rules, and a working memory, which may or may not initially contain any data, assertions or initially known information. The system examines all the rule conditions (IF) and determines a subset, the conflict set, of the rules whose conditions are satisfied based on the working memory. Of this conflict set, one of those rules is triggered (fired).

2 Which one is chosen is based on a conflict resolution strategy. When the rule is fired, any actions specified in its THEN clause are carried out. These actions can modify the working memory, the rule-base itself, or do just about anything else the system programmer decides to include. This loop of firing rules and performing actions continues until one of two conditions are met: there are no more rules whose conditions are satisfied or a rule is fired whose action specifies the program should terminate. Which rule is chosen to fire is a function of the conflict resolution strategy. Which strategy is chosen can be determined by the problem or it may be a matter of preference. In any case, it is vital as it controls which of the applicable rules are fired and thus how the entire system behaves. There are several different strategies, but here are a few of the most common: First Applicable: If the rules are in a specified order, firing the first applicable one allows control over the order in which rules fire. This is the simplest strategy and has a potential for a large problem: that of an infinite loop on the same rule. If the working memory remains the same, as does the rule-base, then the conditions of the first rule have not changed and it will fire again and again. To solve this, it is a common practice to suspend a fired rule and prevent it from re-firing until the data in working memory, that satisfied the rule s conditions, has changed. Random: Though it doesn t provide the predictability or control of the first-applicable strategy, it does have its advantages. For one thing, its unpredictability is an advantage in some circumstances (such as games for example). A random strategy simply chooses a single random rule to fire from the conflict set. Another possibility for a random strategy is a fuzzy rule-based system in which each of the rules has a probability such that some rules are more likely to fire than others. Most Specific: This strategy is based on the number of conditions of the rules. From the conflict set, the rule with the most conditions is chosen. This is based on the assumption that if it has the most conditions then it has the most relevance to the existing data. Least Recently Used: Each of the rules is accompanied by a time or step stamp, which marks the last time it was used. This maximizes the number of individual rules that are fired at least once. If all rules are needed for the solution of a given problem, this is a perfect strategy.

3 "Best" rule: For this to work, each rule is given a weight, which specifies how much it should be considered over the alternatives. The rule with the most preferable outcomes is chosen based on this weight. Methods of Rule-Based Systems Forward-Chaining Rule-based systems, as defined above, are adaptable to a variety of problems. In some problems, information is provided with the rules and the AI follows them to see where they lead. An example of this is a medical diagnosis in which the problem is to diagnose the underlying disease based on a set of symptoms (the working memory). A problem of this nature is solved using a forward-chaining, data-driven, system that compares data in the working memory against the conditions (IF parts) of the rules and determines which rules to fire. For an example of forward-chaining, see Appendix A. Backward-Chaining In other problems, a goal is specified and the AI must find a way to achieve that specified goal. For example, if there is an epidemic of a certain disease, this AI could presume a given individual had the disease and attempt to determine if its diagnosis is correct based on available information. A backward-chaining, goal-driven, system accomplishes this. To do this, the system looks for the action in the THEN clause of the

4 rules that matches the specified goal. In other words, it looks for the rules that can produce this goal. If a rule is found and fired, it takes each of that rule s conditions as goals and continues until either the available data satisfies all of the goals or there are no more rules that match. For an example of backward-chaining, see Appendix B. Which method to use? Of the two methods available, forward- or backward-chaining, the one to use is determined by the problem itself. A comparison of conditions to actions in the rule base can help determine which chaining method is preferred. If the average rule has more conditions than conclusions, that is the typical hypothesis or goal (the

5 conclusions) can lead to many more questions (the conditions), forward-chaining is favored. If the opposite holds true and the average rule has more conclusions than conditions such that each fact may fan out into a large number of new facts or actions, backward-chaining is ideal. If neither is dominant, the number of facts in the working memory may help the decision. If all (relevant) facts are already known, and the purpose of the system is to find where that information leads, forward-chaining should be selected. If, on the other hand, few or no facts are known and the goal is to find if one of many possible conclusions is true, use backward-chaining. Improving Efficiency of Forward Chaining Forward-chaining systems, as powerful as they can be if well designed, can become cumbersome if the problem is too large. As the rule-base and working memory grow, the brute-force method of checking every rule condition against every assertion in the working memory can become quite computationally expensive. Specifically, the computational complexity if the order of RA^C, where R is the number of rules, C is the approximate number of conditions per rule, and A is the number of assertions in working memory. With this exponential complexity, for a rule-base with any real rules, the system will perform quite slowly. There are ways to reduce this complexity, thus making a system of this nature far more feasible for use with real problems. The most effective such solution to this is the Rete algorithm. The Rete algorithm reduces the complexity by reducing the number of comparisons between rule conditions and assertions in the working memory. To accomplish this, the algorithm stores a list of rules matched or partially matched by the current working memory. Thus, it avoids unnecessary computations in re-checking the already matched rules (they are already activated) or un-matched rules (their conditions cannot be satisfied under the existing assertions in the working memory). Only when the working memory changes does it re-check the rules, and then only against the assertions added or removed from working memory. All told, this method drops the complexity to O(RAC), linear rather than exponential. The Rete algorithm, however, requires additional memory to store the state of the system from cycle to cycle. The additional memory can be considerable, but may be justified for the increased speed efficiency. For large problems in which speed is a factor, the Rete method is justified. For small problems, or those in which speed is not an issue but memory is, the Rete method may not be the best option. Another

unfortunate shortcoming of the Rete method is that it only works with forward-chaining Building Rule-Based Systems with Identification Trees Semantic Network A semantic network is the most basic

6 unfortunate shortcoming of the Rete method is that it only works with forward-chaining Building Rule-Based Systems with Identification Trees Semantic Network A semantic network is the most basic structure upon which an identification tree (hereafter referred to as an ID tree) is based. Simply put, a semantic network consists of nodes representing objects and links representing any relations between these objects. In this sample semantic network, Zoom is a feline; a feline is a mammal; a mammal is an animal. Zoom chases a mouse; a mouse is a mammal; a mammal is an animal. Zoom eats fish; fish is an animal. The relations are written on the lines: is a, is an, eats, chases. The nodes (circles) are objects. Semantic Tree At the next level of complexity exists a semantic tree, which is simply a semantic network with a few additional conditions and terms. Each node has a parent to which it is linked (with the exception of the root node which is it s own parent and which needs no link). Each link connects the parent node with any and all children nodes. A single parent node may have multiple children, but no children may have multiple parents. Nodes with no children are the leaf nodes. The difference between a tree and a network is this: a network can have loops, a tree cannot.

Decision Tree Above semantic trees comes a decision tree. Each node of a decision tree is linked to a set of possible solutions.

7 The root node is marked as such. It is parent to itself, A and B. A is child to the root and parent to C. B is child to the root and parent to D and E. C is a child to A and has no children of its own, making it a leaf node. D is parent to F, which is parent to leaf nodes I and J. E, is parent to leaf nodes G and H. Decision Tree Above semantic trees comes a decision tree. Each node of a decision tree is linked to a set of possible solutions. Each parent node, that is each node that is not a leaf (and thus has children) is associated with a test, which splits the set of possible answers into subsets representing every possibility of the test s outcomes.

8 Each non-leaf node serves as a test to lead to one of the leaf outcomes. Identification Trees Last, but not least, an ID tree is a decision tree in which all possible divisions is created by training the tree against a list of known data. The purpose of an ID tree is to take a set of sample data, classify the data and construct a series of test to classify an unknown object based on like properties. Training ID Trees First, the tree must be created and trained. It must be provided with sufficient labeled samples that are used to create the tree itself. It does this by dividing the samples into subsets based on features. The sets of samples at the leaves of the tree define a classification. The tree is created based on Occam s Razor, which (modified for ID trees) states that the simplest (smallest) tree, that is consistent with the training samples, is the best predictor. To find the smallest tree, one could find every possible tree given the data set then examine each one and choose the smallest. However, this is expensive and wasteful. The solution to this, therefore, is to greedily create one small tree: At each node, pick a test such that branches are close to same classification Split into subsets with the least disorder Find which of these tests minimizes the disorder Then: Until each leaf node contains a set that is homogenous or is near homogenous Select a leaf node that is non-homogenous Split this set into two or more homogenous subsets to minimize disorder Since the goal of an ID tree is to generate homogenous subsets, we want to calculate how non-homogenous the subsets each test creates. The test that minimizes the disorder is the one that divides the samples into the cleanest categories. Disorder is calculated as follows: Average disorder = Σ b (nb/nt) * (Σ c (nbc/nb)log 2 (nbc/nb))

9 Where: nb is the number of samples in branch b nt is the total number of samples in all branches nbc is the total of samples in branch b of class c For an example of training an ID tree, see Appendix C. ID Trees to Rules Once an ID tree is constructed successfully, it can be used to generate a rule-set, which will serve to perform the necessary classifications of the ID tree. This is done by creating a single rule for each path from the root to a leaf in the ID tree. For an example of this, see Appendix D. Pruning Unnecessary Conditions If there are conditions of that rule that are inconsequential to the outcome, discard them thus simplifying the rule (and thus improving efficiency). This is accomplished by proving that the outcome is independent of the given condition. Events A and B are independent if the probability of event B does not change given that event A occurs. Using Bayes Rule: P(B A) = P(B) This states that the probability of event B given that event A occurs is equal to the probability that event B occurs by itself. If this holds true, then event A does not effect whether or not event B occurs. If A is a condition and B is a result, then A can be discarded without affecting the rule. For an example of this, see Appendix E. Pruning Unnecessary Rules If two or more rules share the same end result, you may be able to replace them with a rule that fires in the event that no other rule is fired: if (no other rule fires) then (execute these common actions) If there is more than one such group of rules, replace only one group. Which one is determined by some heuristic tiebreaker. Two such tiebreakers follow:

10 Replace the larger of the two groups. If group A has six rules which share a common result and group B only has five, replace the larger group A with will eliminate more rules and simplify the rule base the most. Replace the group with the highest average number of rule conditions. While more rules may remain, the rules that remain will be more simple as they have fewer conditions. For example, given the rules: if (x) and (y) and (z) then (A) if (m) and (n) and (o) then (A) vs. if (p) then (Z) if (q) then (Z) You would want to replace the first set with: if (no other rule fires) then (A) For an example of this, see Appendix F. With enough training data, an ID tree can be created which, in turn, can be used to create a rule-base for classification. From then on, using forward-chaining, a new entity can be introduced as an assertion in the knowledge base and it can be classified as if by the ID tree. Using backward-chaining, one could use it to find evidence to support that a given classification is valid Conclusion I have heard a few people, including some of my classmates, say that rule-based and expert systems are obsolete; and that ID trees are a thing of the past. Granted, this is not the direction that most research is moving, but that doesn t negate the existing accomplishments of these architectures. As it stands, expert rule-based systems are the most widely used and accepted AI in the world outside of games. The fields of medicine, finance and many others have benefited greatly by intelligent use of such systems. With the combination of rule-based systems and ID trees, there is great potential for most fields.

11 Appendices Appendix A -- Forward-Chaining Example: Medical Diagnosis Assertions (Working Memory): A1: runny nose A2: temperature=101.7 A3: headache A4: cough Rules (Rule-Base): R1: if (nasal congestion) (viremia) then diagnose (influenza) exit R2: if (runny nose) then assert (nasal congestion) R3: if (body- aches) then assert (achiness) R4: if (temp >100) then assert (fever) R5: if (headache) then assert (achiness) R6: if (fever) (achiness) (cough) then assert (viremia) Execution: 1. R2 fires, adding (nasal congestion) to working memory.

12 2. R4 fires, adding (fever) to working memory. 3. R5 fires, adding (achiness) to working memory. 4. R6 fires, adding (viremia) to working memory. 5. R1 fires, diagnosing the disease as (influenza) and exits, returning the diagnosis Appendix B -- Backward-Chaining Example: Medical Diagnosis Use same rules/assertions from Appendix A Hypothesis/Goal: Diagnosis (influenza) Execution: 1. R1 fires since the goal, diagnosis(influenza), matches the conclusion of that rule. New goals are created: (nasal congestion) and (viremia) and backchaining is recursively called with these new goals. 2. R2 fires, matching goal nasal congestion. New goal is created: (runny nose). Backchaining is recursively called. Since (runny nose) is in working memory, it returns true. 3. R6 fires, matching goal viremia. Back-chaining recursion with new goals: (fever), (achiness) and (cough) 4. R4 fires, adding goal (temperature > 100). Since (temperature = 101.7) is in working memory, it returns true. 5. R3 fires, adding goal (body-aches). On recursion, there is no information in working memory nor rules that match this goal. Therefore it returns false and the next matching rule is chosen. That rule is R5 which fires, adding goal (headache). Since (headache) is in working memory, it returns true. 6. Goal (cough) is in working memory, so that returns true. 7. Now, all recursive procedures have returned true, the system exits, returning true: this hypothesis was correct: subject has influenza. Appendix C -- Identification Tree Training The identification tree will be trained on the following data:

based solely on the divisions within that category.

13 We greedily create a small ID tree from this. For each column (save the first and last, since the first is simply an identifier and the last is the result we re trying to identify) we create a tree based solely on the divisions within that category. The resulting trees are as follows: From these, we calculate the disorder of each: Size_Disorder = Σ b (nb/nt) * (Σ c -(nbc/nb)log 2 (nbc/nb)) = (4/8) * ((-(2/4) log 2 (2/4)) + (-(2/4) log 2 (2/4))) + ((1/8) * 0) + ((3/8) * 0) = 0.5 The disorder for the size test is 0.5. The other disorders are as follows: Size: 0.5 Color: 0.69 Weight: 0.94 Rubber: 0.61

Therefore, our final, simplest ID tree which represents the data is: Appendix D -- Creating Rules from ID trees Given the final ID tree from appendix C, follow from the root test down to

14 Since Size is the lowest disorder, we take that one and further break down any unhomogenous sets. There is only one, those of the small branch. The following trees are resulting from the further division of the size=small test. The Rubber test splits the remaining samples into perfect subsets with 0 disorder. Therefore, our final, simplest ID tree which represents the data is: Appendix D -- Creating Rules from ID trees Given the final ID tree from appendix C, follow from the root test down to each outcome with each node visited becoming a condition of our rule. This gives us the following rules: First, we'll follow the rightmost path from the root node: size=large medium. Of the three in this branch, none bounce. R1: if (size = medium)

15 then (ball does not bounce) Next, we examine the next branch: size=medium large. There is only one in this branch, and it bounces. Based on this data (this may change under a larger training set) all medium balls bounce. R2: if (size = large) then (ball does bounce) Third, we follow the leftmost branch: size=no. This leads us to another decision node. Taking the rightmost branch: rubber=no gives us this rule: R3: if (size = small) (rubber = no) then (ball does not bounce) And finally, we follow the first branch left: size=no and at the next test follow rubber=yes. The following rule is produced: R4: if (size = small) (rubber = yes) then (ball does bounce) Appendix E -- Eliminating unnecessary rule conditions Given the rules provided in appendix D, we see if there s any way to simplify those rules by eliminating unnecessary rule conditions. The last two rules have two conditions each. Consider, for example, the first of these, R3: R3: if (size = small) (rubber = no) then (ball does not bounce) Looking at the probability with event A = (size=small) and event B = (ball does not bounce) P(B A) = (3 non rubber balls do not bounce / 8 total) = P(B) = (3 non rubber balls / 8 total) = P(B A) = P(B) therefore B is independent of A

16 If we were to eliminate the first condition: size=small, then this rule would trigger for every ball not made of rubber. There are 3 balls not made of rubber. They are 2, 3 and 8 none of these bounce. Because none bounce, the size does not affect this and we can eliminate that condition. if (rubber = no) then (ball does not bounce) Examining the next condition, the probability for A = (rubber=no) and B the same: P(B A) = (2 small balls do not bounce / 8 total) = 0.25 P(B) = (4 small balls / 8 total) = 0.5 P(B A) does not equal P(B) therefore A and B are not independent If you eliminate the next condition in the same rule, (rubber = no) this triggers for every small ball. Of the small balls, two bounce and two do not. Therefore, the rubber does affect if they bounce or not and cannot be eliminated. The small balls bounce only if they are rubber. Now, the next rule with two conditions: R4: if (size = small) (rubber = yes) then (ball does bounce) Examining the probabilities: A = (size=small) B = (ball does bounce) P(B A) = P(2 small balls bounce / 8 total) = 0.25 P(B) = P(4 small balls / 8 total) = 0.5 P(B A) does not equal P(B) therefore A and B are not independent. If we eliminate the first rule, it fires for all rubber balls. Of the five rubber balls, two are small and both bounce. Of the other three, one bounces and two do not. For this rule, (size=small) is important. On to the next condition. Examining the probabilities: A = (rubber=yes) B = (ball does bounce) P(B A) = P(3 rubber balls bounce / 8 total) = P(B) = P(5 rubber balls / 8 total) = 0.625

17 P(B A) does not equal P(B) therefore A and B are not independent Eliminating the second fires for all small balls. Of the four small, two bounce and two do not. Again, the condition is significant and cannot be dropped. Therefore this rule must stay as it is. Appendix F -- Eliminating unnecessary rules We have the following simplified rules from Appendix E: R1: if (size = large) then (ball does not bounce) R2: if (size = medium) then (ball does bounce) R3: if (rubber = no) then (ball does not bounce) R4: if (size = small) (rubber = yes) then (ball does bounce) Of these, we have two sets of rules, each set shares a common result. The first group consists of rules R1 and R3. The second consists of rules R2 and R4. We can eliminate one of these sets and replace it with the rule: if (no other rule fires) then (perform these common actions) Both sets have the same number of rules, 2, but the second set has more conditions than the first. So, we ll eliminate the second set and replace it with: if (no other rule fires) then (ball does bounce) Our final rule-base is: R1: if (size = large) then (ball does not bounce)

18 R2: if (rubber = no) then (ball does not bounce) R3: if (no other rule fires) then (ball does bounce) Appendix G -- Additional Online Resources A much more in-depth examination of the Rete Method Another source about ID tree machine learning CLIPS: A tool for building Expert Systems FuzzyCLIPS is an extension of the CLIPS expert system shell. A list of papers from CiteSeer Companion for a book, there's a section for Rule-Based Systems Some class notes (not mine) on Rule-Based Systems Some class notes (not mine) on Expert Systems

Decision Tree CE-717 : Machine Learning Sharif University of Technology

Decision Tree CE-717 : Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Some slides have been adapted from: Prof. Tom Mitchell Decision tree Approximating functions of usually discrete