Learning Rules. Learning Rules from Decision Trees

Similar documents
Classification with Decision Tree Induction

Lecture 6: Inductive Logic Programming

Decision Trees: Discussion

Decision Tree CE-717 : Machine Learning Sharif University of Technology

BITS F464: MACHINE LEARNING

Decision tree learning

9/6/14. Our first learning algorithm. Comp 135 Introduction to Machine Learning and Data Mining. knn Algorithm. knn Algorithm (simple form)

Decision Tree Learning

Knowledge in Learning. Sequential covering algorithms FOIL Induction as inverse of deduction Inductive Logic Programming

Constraint Solving. Systems and Internet Infrastructure Security

Lecture 6: Inductive Logic Programming

Introduction to Machine Learning

CONCEPT FORMATION AND DECISION TREE INDUCTION USING THE GENETIC PROGRAMMING PARADIGM

Rule induction. Dr Beatriz de la Iglesia

Learning Rules. CS 391L: Machine Learning: Rule Learning. Raymond J. Mooney. Rule Learning Approaches. Decision-Trees to Rules. Sequential Covering

Outline. RainForest A Framework for Fast Decision Tree Construction of Large Datasets. Introduction. Introduction. Introduction (cont d)

Homework 1 Sample Solution

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

Learning Rules. How to use rules? Known methods to learn rules: Comments: 2.1 Learning association rules: General idea

Representing structural patterns: Reading Material: Chapter 3 of the textbook by Witten

Logic Languages. Hwansoo Han

What Is Data Mining? CMPT 354: Database I -- Data Mining 2

Data Mining. Decision Tree. Hamid Beigy. Sharif University of Technology. Fall 1396

Prolog Programming. Lecture Module 8

Decision Trees. Query Selection

Midterm Examination CS 540-2: Introduction to Artificial Intelligence

Datalog Evaluation. Linh Anh Nguyen. Institute of Informatics University of Warsaw

Nominal Data. May not have a numerical representation Distance measures might not make sense. PR and ANN

Symbolic AI. Andre Freitas. Photo by Vasilyev Alexandr

Lecture 5: Decision Trees (Part II)

Nominal Data. May not have a numerical representation Distance measures might not make sense PR, ANN, & ML

CS34800, Fall 2016, Assignment 4

PROPOSITIONAL LOGIC (2)

Advanced learning algorithms

Fundamentals of Prolog

Data Mining and Analytics

Data Analytics and Boolean Algebras

Data Mining and Machine Learning: Techniques and Algorithms

Machine Learning: Symbolische Ansätze

Lecture 5 of 42. Decision Trees, Occam s Razor, and Overfitting

8 NP-complete problem Hard problems: demo

Midterm Examination CS540-2: Introduction to Artificial Intelligence

Section 1.1 Logic LOGIC

Data Mining Part 5. Prediction

Machine Learning Chapter 2. Input

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

Chapter 4: Algorithms CS 795

References. Topic #16: Logic Programming. Motivation. What is a program? Prolog Program Structure

ISSN: [Lakshmikandan* et al., 6(3): March, 2017] Impact Factor: 4.116

Chapter 4: Algorithms CS 795

Association Rules. Charles Sutton Data Mining and Exploration Spring Based on slides by Chris Williams and Amos Storkey. Thursday, 8 March 12

Annoucements. Where are we now? Today. A motivating example. Recursion! Lecture 21 Recursive Query Evaluation and Datalog Instructor: Sudeepa Roy

Machine Learning. Decision Trees. Manfred Huber

DATABASE THEORY. Lecture 12: Evaluation of Datalog (2) TU Dresden, 30 June Markus Krötzsch

Data Mining. 3.3 Rule-Based Classification. Fall Instructor: Dr. Masoud Yaghini. Rule-Based Classification

Data Mining Algorithms: Basic Methods

Logic As a Query Language. Datalog. A Logical Rule. Anatomy of a Rule. sub-goals Are Atoms. Anatomy of a Rule

Notes for Chapter 12 Logic Programming. The AI War Basic Concepts of Logic Programming Prolog Review questions

Topic B: Backtracking and Lists

Basic Concepts Weka Workbench and its terminology

MACHINE LEARNING Example: Google search

h=[3,2,5,7], pos=[2,1], neg=[4,4]

An introduction to logic programming with Prolog

Module 6. Knowledge Representation and Logic (First Order Logic) Version 2 CSE IIT, Kharagpur

CSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas

CSE 634/590 Data mining Extra Credit: Classification by Association rules: Example Problem. Muhammad Asiful Islam, SBID:

ISSUES IN DECISION TREE LEARNING

Integrity Constraints (Chapter 7.3) Overview. Bottom-Up. Top-Down. Integrity Constraint. Disjunctive & Negative Knowledge. Proof by Refutation

Instance-Based Representations. k-nearest Neighbor. k-nearest Neighbor. k-nearest Neighbor. exemplars + distance measure. Challenges.

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

DATABASE THEORY. Lecture 15: Datalog Evaluation (2) TU Dresden, 26th June Markus Krötzsch Knowledge-Based Systems

Resolution (14A) Young W. Lim 6/14/14

CSL105: Discrete Mathematical Structures. Ragesh Jaiswal, CSE, IIT Delhi

Association Rule Learning

Unsupervised: no target value to predict

Chapter 2: Classification & Prediction

COMP2411 Lecture 20: Logic Programming Examples. (This material not in the book)

Practical Data Mining COMP-321B. Tutorial 1: Introduction to the WEKA Explorer

Implementation of Classification Rules using Oracle PL/SQL

Business Club. Decision Trees

Algorithms and Data Structures

Data Mining Practical Machine Learning Tools and Techniques

Gen := 0. Create Initial Random Population. Termination Criterion Satisfied? Yes. Evaluate fitness of each individual in population.

Chapter 16. Logic Programming Languages

Boolean Feature Discovery in Empirical Learning

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic

Data Mining Practical Machine Learning Tools and Techniques

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Example: Map coloring

SECTION 1.5: LOGIC PROGRAMMING

What is Learning? CS 343: Artificial Intelligence Machine Learning. Raymond J. Mooney. Problem Solving / Planning / Control.

Prolog Introduction. Gunnar Gotshalks PI-1

Lecture 14: Lower Bounds for Tree Resolution

DATABASE THEORY. Lecture 11: Introduction to Datalog. TU Dresden, 12th June Markus Krötzsch Knowledge-Based Systems

Knowledge-Based Systems and Deductive Databases

5. Logical Programming

Trees : Part 1. Section 4.1. Theory and Terminology. A Tree? A Tree? Theory and Terminology. Theory and Terminology

(More) Propositional Logic and an Intro to Predicate Logic. CSCI 3202, Fall 2010

DESIGN AND IMPLEMENTATION OF BUILDING DECISION TREE USING C4.5 ALGORITHM

Problem Set 2 Solutions

Transcription:

Learning Rules In learning rules, we are interested in learning rules of the form: if A 1 A 2... then C where A 1, A 2,... are the preconditions/constraints/body/ antecedents of the rule and C is the postcondition/head/ consequent of the rule. A first-order rule can contain variables, as in: if Parent(x, z) Ancestor(z, y) then Ancestor(x, y) A Horn clause is a rule with no negations. Learning Rules from Decision Trees Each path in a decision tree from root to a leaf: outlook sunny overcast humidity good high normal bad good windy true false temp rain hot mild cool bad outlook outlook good sunny overcast rain s o r bad good??? bad??? good can be converted to a rule, e.g.: if windy hot sunny then bad Why? Rules are considered more readable/understandable. Different pruning decisions can be made for different rules.

Rule Post-Pruning Rule post-pruning removes conditions to improve error. if windy sunny high then bad if windy sunny normal then good if windy overcast then good if windy rain then bad if windy hot sunny then bad if windy hot overcast then good if windy mild sunny then bad if windy mild rain then good if windy cool then good Rule Ordering Order rules by accuracy and coverage. 4 if overcast then good 3 if windy rain then good 3 if sunny high then bad 2 if sunny normal then good 2 if windy cool then good 2 if windy rain then bad 2 if hot sunny then bad 1 if windy mild sunny then bad

Sequential Covering Algorithms Basic Algorithmic Idea: 1. Learn one good rule. 2. Remove the examples covered by the rule. 3. Repeat until no examples are left. The book focuses on rules to cover the positive examples. This is because negative/false is the default answer in logic programming languages like Prolog. However, the same algorithms can be applied to negative examples, or to multi-class datasets. Considerations A good rule has low error (entropy) and high coverage. A conjunction of two or more attribute tests might be required to obtain a good rule. A depth-first greedy search (start from no tests, then repeatedly add best looking test) can lead down a bad path. In an example-driven search, choose one example and only add tests that cover that example. A beem search maintains a list of the k best candidates, i.e., searches k paths at a time instead of just one.

Learn-One-Rule Beam-Search-Algorithm Learn-One-Rule best, frontier {best} while frontier is not empty candidates all hypotheses generated by adding a test to a hypothesis in frontier best best 1 hypothesis in {best} candidates frontier best 2 k hypotheses in candidates return best best 1 might be error rate. best 2 might be entropy (different only for multi-class) Both might require covering a minimum number of exs. Learning First-Order Rules Suppose we are interested in learning concepts involving relationships between objects. When one person is an ancestor of another number. When one number is less than another number. When one node can reach another node in a graph. When an element is in a set. The concept involves intermediate objects and relations. Examples can t be written as a fixed number of attributes.

Terminology: Examples of Atoms To say that an object has a property, e.g., 3 is negative, we write Negative( 3), where Negative is a predicate and 3 is a constant. To say that multiple objects have a relationship, e.g., Liz is the mother of Charley, we write Mother(Liz, Charley), Mother is a predicate and Liz and Charley are constants. To say that attributes of objects have a relationship, e.g., Liz s computer is faster than Charley s, we write Faster(computer(Liz), computer(charley)), where Faster is a predicate, computer is a function, and Liz and Charley are constants. Terminology: Examples of Rules To say that negative numbers are less than positive numbers, we write: Less(x, y) Negative(x) Positive(y), where x and y are variables. Do not write it in this way: Less(Negative(x), Positive(y)) The rule for grandmother can be written as: Grandmother(x, y) Mother(x, z) Parent(z, y) Note that an additional variable z is needed to define the relationship between x and y. Rules can be recursive: (x+y =z) (u+v =z) (x+1=u) (v+1=y) where x+y =z is short for Equal(plus(x, y), z).

Terminology A term is any constant or variable, or any function applied to terms. An atom is any predicate applied to terms. A literal is an atom (positive) or the negation of an atom (negative). A clause is a disjunction (or) of literals. A Horn clause has at most one positive literal. A Horn clause H L 1 L 2... can be written as H (L 1 L 2...) Example Suppose we want to learn reachability in graphs. A B C D E In this graph, the predicate Edge is true for: (A, B) (B, C) (C, D) (D, B) (D, E) and false for: (A, A) (A, C) (A, D) (A, E) (B, A) (B, B) (B, D) (B, E) (C, A) (C, B) (C, C) (C, E) (D, A) (D, C) (D, D) (E, A) (E, B) (E, C) (E, D) (E, E)

The predicate Reach is true for: (A, B) (A, C) (A, D) (A, E) (B, B) (B, C) (B, D) (B, E) (C, B) (C, C) (C, D) (C, E) (D, B) (D, C) (D, D) (D, E) and false for: (A, A) (B, A) (C, A) (D, A) (E, A) (E, B) (E, C) (E, D) (E, E) We want literals that make Reach(x, y) more true. When choosing a literal, the FOIL program maximizes: p 1 p 0 p 1 log 2 log p 1 + n 2 1 p 0 + n 0 where p 1 (n 1 ) is the number of pos. (neg.) examples still covered after adding L. p 0 and n 0 are the before numbers. First Rule for Example Edge(x, y) is true for 5 pos. exs. and no neg. exs. Reach(x, y). FOIL s measure is 3.20 of Edge(x, z) is true for all 16 pos. exs. and 4 neg. exs. of Reach(x, y). FOIL s measure is 5.15 Despite FOIL s judgment, I ll select: Reach(x, y) Edge(x, y) for the first rule, leaving the following pos. exs.: (A, C) (A, D) (A, E) (B, B) (B, D) (B, E) (C, B) (C, C) (C, E) (D, C) (D, D)

Second Rule for Example Now Reach(x, y) Edge(x, z) is true for all 11 pos. exs. and the following 4 neg. exs. FOIL s measure is 4.57 (A, A) (B, A) (C, A) (D, A) Reach(x, y) Edge(w, y) is also true for all 11 pos. exs., but a different 4 neg. exs. (E, B) (E, C) (E, D) (E, E) Picking Edge(x, z), then three further additions are: Reach(x, y) Edge(x, z) Edge(z, y) Reach(x, y) Edge(x, z) Edge(w, y) Reach(x, y) Edge(x, z) Reach(z, y) All rules are true for 0 neg. exs., but the first rule is true for 5 pos. exs. while the second and third rules are true for all 11. As a result, the second rule might be: Reach(x, y) Edge(x, z) Edge(w, y) This second rule is not very good. All it requires is an edge from x and an edge to y. In fact, FOIL only outputs this one rule on this data. A graph that has more kinds of paths is needed. The recursive rule is the correct one. A special check makes sure infinite recursion is avoided.