Data Mining using Ant Colony Optimization. Presentation Outline. Introduction. Thanks to: Johannes Singler, Bryan Atkinson

Size: px

Start display at page:

Download "Data Mining using Ant Colony Optimization. Presentation Outline. Introduction. Thanks to: Johannes Singler, Bryan Atkinson"

Brian Dale Phelps
6 years ago
Views:

1 Data Mining using Ant Colony Optimization Thanks to: Johannes Singler, Bryan Atkinson Presentation Outline Introduction to Data Mining Rule Induction for Classification AntMiner Overview: Input/Output Rule Construction Quality Measurement Pheromone: Initial/Updating Experiments/Results Performance/Complexity Swarm-based Genetic Programming Introduction to GP, Symbolic Regression Crossover problems Ant Colony Crossover Experiments and Results Introduction Data Mining tries to find: hidden knowledge unexpected patterns new rules in large databases. Discovery of useful summaries of data Is a key element of much more elaborate process: Knowledge Discovery in Databases (KDD) 1

Goals of Rule Induction Stage of Data Mining: Rule Induction Find rules to describe data in some way Not only accurate but also comprehensible for a human user

(preferably simple) rules to classify data Algorithm by Parpinelli, Lopes and Freitas: AntMiner ACO + Genetic Programming Symbolic regression Rule Induction

2 Goals of Rule Induction Stage of Data Mining: Rule Induction Find rules to describe data in some way Not only accurate but also comprehensible for a human user to support decision making Focus in this Talk Rule Induction for Classification using ACO Given: training set (instances/cases to classify) Goal: to come up with (preferably simple) rules to classify data Algorithm by Parpinelli, Lopes and Freitas: AntMiner ACO + Genetic Programming Symbolic regression Rule Induction Possible Outputs for Rule Induction decision trees (ordered) decision lists [here] if <attribute1>=<value1> and <attribute2>=<value2> and then <class>=<class1> else if 2

3 AntMiner Input Training set / test set Attribute / value pairs Given classes / classification AntMiner Output Ordered decision list Ordered list of IF-THEN-Rules like IF <condition> THEN <class> <condition> = <term1> AND <term2> AND <term> = <attribute> = <value> + Default rule (majority value) First rule fires. Only discrete attributes supported so far. Continuous values must be discretized before. This is a quite limited version of a decision list. Prerequisites for an ACO (Review) Problem-dependent heuristic function (η) for measuring the quality of items that could be added to the partial solution so far. Pheromone updating rule (τ) Probabilistic transition rule based on η and τ Difference to most ACO algorithms mentioned in class: Does not use a graph representation of the problem. 3

4 AntMiner Algorithm: Top-Level Pseudo-Code for finding one rule set: trainingset = {all training cases} discoveredrulelist = [ ] WHILE( trainingset still too big) Initialize pheromone (equally distributed) Ants try to find a good classification rule by the ACO heuristic Add best rule found to discoveredrulelist Remove correctly covered examples from trainingset AntMiner Algorithm: Mid-Level Pseudo-Code for finding one rule: Repeat Start new ant with empty rule (antecedent) Construct rule by adding one term at a time and choosing the rule consequent subsequently Prune rule Increase pheromone on trail which ant used according to the quality of the rule Until (maximum number z of ants exceeded) or (no improvement any more during the last k iterations) Actually only the population of one ant at a time working. AntMiner Algorithm: Bottom- Level Repeat as long as possible: Add one condition to the rule. Use probabilistic approach referring to pheromone concentration and heuristic. Do not use attributes twice. Resulting rule must cover at least a minimum of cases. After having finished the antecedent, calculate the resulting class. 4

5 Rule Construction Probability for adding <A i >=<V ij > P ij = " ij # ij (t) [normalized] where A i the i-th attribute V ij the j-th possible value of the i-th attribute η heuristic function, τ pheromone trail Heuristic Function (η) Analogous to: Proximity function in TSP Colouring matrix in graph colouring problem. Uses information theory (entropy). Split instances using rule. Quality corresponds to entropy of remaining buckets ; the less, the better. k H(W A j = V ij ) = "#(P(w A j = V ij ). log 2 P(w A j = V ij )) w=1 " ij # log 2 k $ H(W A j = V ij ) [normalized] where k is number of classes Information Heuristic Example For T, high = >80, mild = 70<T 80, cold = 0<T 70 (for later) P(play outlook=sunny)=2/14=0.143, P(don t play outlook=sunny)=3/14=0.214 H(W,outlook=sunny)= log(0.143) log(0.214)=0.877 η= log 2 k H(W,outlook=sunny) = =

6 Information Heuristic Example For H, high = >85, normal= 0<T 85, (for later) P(play outlook=overcast)=4/14=0.286, P(don t play outlook=overcast)=0/14=0 H(W,outlook=sunny)= log(0.286)=0.516 η= log 2 k H(W,outlook=sunny) = =0.484 Quality Function Measuring the classification quality of a rule / several rules. For one rule: sensitivity specificity TP Q = TP + FN. TN FP + TN where T=true, F=false, P=positive, N=negative The bigger the value of Q, the better Measuring the simplicity of a rule: number of rules average number of terms per rule The less, the simpler, thus the better. Rule Pruning Iteratively remove one-term-at-a-time from the rule while this process improves the classification accuracy of the rule. Majority class might change. If ambiguous, remove term that improves the accuracy the most. Simplicity improves anyway. 6

Pheromone Initial pheromone value: " ij (t = 0) = 1

attributes and b i is the number of possible values of

First increase pheromone of used terms regarding rule

(1+ Q) Then normalize the pheromone level of all terms

Apply in the order they were discovered.

7 Pheromone Initial pheromone value: " ij (t = 0) = 1 [normalized] a b i # i=1 where a is the total number of attributes and b i is the number of possible values of A i. Pheromone Updating (τ) Values before (1). First increase pheromone of used terms regarding rule quality (2): " ij (t +1) = " ij (t).(1+ Q) Then normalize the pheromone level of all terms pheromone evaporation (3) Using the Discovered Rules Apply in the order they were discovered. First rule that covers case is applied. If no rule covers case, apply default result (majority value). 7

8 Possible Discretization of Continuous Attributes Use C4.5-Disc Quick overview: Extract reduced data set that only contains attribute to discretize and desired classification. From that build up decision tree using the C4.5 algorithm (another rule induction algorithm). Result: Decision tree with binary decisions x a go left; x > a go right Each path corresponds to the definition of a categorical interval. AntMiner s Parameters Number of ants (3000 used in experiments). Also limits the maximum number of rules found for a classification. Is not necessarily exploited because algorithm might converge before. Minimum number of cases per rule (10). Each rule must at least cover so many cases. Avoids overfitting. Maximum number of uncovered classes in the training set (10). The algorithm stops when there are only fewer instances left. Number of rules to test for the convergence of the ants (10). The algorithm waits so long for an improvement. Sample Run Start Deciding whether to play outside Attributes: outlook, temperature, humidity, windy, play Classes: play (yes), do not play (no) sunny,hot,high,false,no (1) sunny,hot,high,true,no (2) overcast,hot,normal,false,yes (3) rainy,mild,high,false,yes (4) rainy,cool,normal,false,yes (5) rainy,cool,normal,true,no (6) overcast,cool,normal,true,yes (7) sunny,mild,high,false,no (8) sunny,cool,normal,false,yes (9) rainy,mild,normal,false,yes (10) sunny,mild,normal,true,yes (11) overcast,mild,high,true,yes (12) overcast,hot,normal,false,yes (13) rainy,mild,high,true,no (14) Sample run for finding one rule set. Start: I={all}, R={} Ant 1: Choose probabilistically outlook=overcast (then play=yes) Ant 1: Chooses values for other attributes Ant 1: Finishes because all attributes are used. Ant 1: Last three conditions are pruned away. I={1,2,4,5,6,8,9,10,11,14}, R={outlook=overcast yes) Ant 2: Choose outlook=rainy (then play=yes) Rule is not good enough (3:2) Ant 2: Choose windy=true (then play=no) Ant 2 finishes because otherwise covered set would be too small. No pruning possible either. 8

9 Sample Run Result Possible result (not most simple): outlook=overcast play=yes outlook=rainy, windy=false play=yes outlook=sunny, humidity=normal play=yes otherwise play=no Comparison to CN2 Algorithm Uses beam search (limited breadth first search with beam width b). Add all possible terms to current partial rules, evaluate, and retain only the b best ones. No feedback for constructing new rules. Output format is the same (ordered rule list). Uses entropy heuristic as well. Experiment Setup Dimension roughly: cases, 9 34 attributes, 2 6 classes Tests run using a 10-fold cross-validation procedure Divide data into 10 partitions. For each partition do Treat it as the test data and use the other 90% as the training data. Measure the performance. Take the average value. This helps to achieve significant results. 9

DataSets Performance Results No particular parameter optimizations for both algorithms. Same computation time. Extensions to the Algorithm By Galea [3].

10 DataSets Performance Results No particular parameter optimizations for both algorithms. Same computation time. Extensions to the Algorithm By Galea [3]. Deterministic rule with q probability as in ACS-TSP. Choose probabilistically (considering pheromone trail and heuristic function) with probability q. Otherwise deterministically choose term with maximum probability. Improves results slightly. Extension for fuzzy rules also possible. 10

11 Comparative Results Side-by-side Comparison Effects of Rule Pruning 11

attribute; considered small; O(1) k: number of conditions per inspected rule

12 Generated Rules Terms per Rule Algorithm Complexity Introducing a lot of variables n: number of cases a: number of attributes v: number of values per attribute; considered small; O(1) k: number of conditions per inspected rule while evaluating and pruning z: number of ants r: number of discovered rules 12

13 Complexity Comparison Ant-Miner, average case: Ant-Miner, worst case k = O(a): CN2: O(r.z.[k.a + n.k 3 ] + a.n) O(r.z.a 3.n) O(a(n + log(a))) Further Experiments Further experiments by the authors of AntMiner show that ACO really helps: Use of pheromone trails improves the average solution. Use of rule pruning improves the simplicity without harming the quality. References [1] Data Mining with an Ant Colony Optimization Algorithm. Parpinelli, Lopez, Freitas [2] An Ant Colony Based System for Data Mining: Applications To Medical. Parpinelli, Lopez, Freitas 2001 [3] Applying Swarm Intelligence to Rule Induction. Michelle Galea [4] The CN2 Induction Algorithm. Clark, Niblett [5] Data Mining. Adriaans, Zantinge. Addison-Wesley [6] Learning Fuzzy Rules Using Ant Colony Optimization Algorithms. Casillas, Cordón, Herrera [7] Bryan Atkinson Honours Project Report: n-atkinson-winter-2006.pdf 13

14 Ant-based Programming Genetic Programming has been successful at inducing program descriptions Problems with scaling: Diversity Retaining useful fragments: Avoiding disruption of higher order functions Can ACO help? Maybe, learn useful associations, avoid disruption Genetic Programming Programs represented in tree structure Learning through: Population-based, evolutionary search Genetic operators: crossover, mutation Requires specification of: Functions (F): internal nodes Terminals (T): leaf nodes Symbolic Regression: F = {+, -, /, *, sin, cos, exp} T = {integers in range (-5, 5), } Symbolic Regression Find function that best fits a number of sample points. Good fit determined by hits: candidate function within threshold distance size(d) 1 f (k) = h(k)" # e(k,i) max(h(k),1) i =1 e(k,i) = abs( ( v(k,x(i))" y(i) )) v(k,x) = Value of k th program for x h(k) = size(d ) " i =1 hits(k,i) 0 if e(k,i)# $ hits(k,i) = 1 otherwise 14

15 Symbolic Regression Example GP: Mathematically: 3x + sin(x) + Crossover + * cos sin cos * * * sin + * cos Problem: can disrupt useful couplings *- easily * * Adapting Crossover with ACO Use context-aware crossover Basic crossover chooses node randomly -- context unaware Adapt crossover to remember useful function couplings Not automatically defined functions (ADFs) 15

16 Function Coupling Matrix (C) Function + * sin cos * sin cos Important couplings have high values; e.g. sin-x Swarm-based GP (SB-GP) Three modifications to GP: 1. Initialization of Coupling matrix, C. 2. Crossover using coupling matrix. 3. Pheromone update based upon program fitness. Pheromone Initialization For all function and terminal coupling (i, j): Initialize pheromone, τ i,j, to initial value, τ 0 τ 0 is system parameter 16

17 Ant Colony (AC) Crossover Choose a random branch, B, from root to a leaf in program tree P n For every edge i, j in B Probability of choosing node i as root of subtree S n where i is parent and j is a child node is given by: p(i, n) = (τ max (n) - τ min (n) + τ i,j (n)) / Τ(n) Choose random branch, B, from root to a leaf in program tree P m For every edge i, j in B Probability of choosing node i as root of subtree S m p(i, m) = (τ max (m) - τ min (m) + τ i,j (m)) / Τ(m) where T(k) is given by: AC Crossover Continued T(k) = Σ i,j E(k) (τ max (k) - τ min (k) + τ i,j (k)) and τ i,j (k) = C(V(k,i),V(k,j)) τ max (k) = max i,j E(k) (τ i,j (k)) τ min (k) = min i,j E(k) (τ i,j (k)) and E(k) = { edges in k th program subtree } AC Crossover Example 17

Experimental Parameters Parameter Value Parameter Value Initial Pheromone 0 Evaporation rate p Best k programs used for evaluation Max Program Depth 10-5 Min Program Depth 0.

18 Experimental Parameters Parameter Value Parameter Value Initial Pheromone 0 Evaporation rate p Best k programs used for evaluation Max Program Depth 10-5 Min Program Depth 0.9 Tournament Size 30 Crossover probability 15 Mutation Probability 1 Number of Generations (default) Functions and Results F1: cos( 2 )+sin( 2 )+ 2 F2: cos( 2 )+sin( 2 )+ 2 +cos()+sin() F3: sin()* 4 +sin()* 3 +sin()* 2 + sin()* Test GP Mean GP STD SB- GP Mean SB- GP STD P Value Population Size F F F F F F3: Function Couplings 18

19 Conclusions Statistically significant improvement in performance Useful couplings learnt Number of successful trials increased Couplings can saturate: Use ACS-style q mechanism to choose randomly some of time 19

An Ant Colony Based System for Data Mining: Applications to Medical Data

An Ant Colony Based System for Data Mining: Applications to Medical Data Rafael S. Parpinelli 1 Heitor S. Lopes 1 Alex A. Freitas 2 1 CEFET-PR, CPGEI Av. Sete de Setembro, 3165 Curitiba - PR, 80230-901