Preprocessing and Feature Selection DWML, /16

Size: px

Start display at page:

Download "Preprocessing and Feature Selection DWML, /16"

Gwenda White
5 years ago
Views:

1 Preprocessing and Feature Selection DWML, /16

2 When features don t help Data generated by process described by Bayesian network: Class Class A 1 Class A 1 A 3 A 3 Class 0 1 A 2 A A Attribute A 2 is just a duplicate of A 1. Conditional class probability for example: P( A 1 = 1, A 2 = 1, A 3 = 0) = DWML, /16

3 The Naive Bayes model learned from data: Class Class A 1 A 2 A 3 A 1 Class A 2 Class A 3 Class In Naive Bayes model: P( A 1 = 1, A 2 = 1, A 3 = 0) = Intuitively: the NB model double counts the information provided by A 1, A 2. DWML, /16

4 The Naive Bayes model with selected features A 1 and A 3 : Class Class A 1 A 3 A 1 Class A 3 Class In this Naive Bayes model: P( A 1 = 1, A 3 = 0) = (and all other posterior class probabilities are also the same as for the true model). DWML, /16

5 Decision trees learned from the same data: Decision Tree A A A 1 A 2 A 3 A / : 0.66/ 0.33 / : 0.57/ 0.43 / : 0.46/ 0.54 / : 0.36/ 0.64 / : 0.66/ 0.33 / : 0.46/ 0.54 / : 0.57/ 0.43 / : 0.36/ 0.64 Decision tree does not test two equivalent variables twice on one branch (but might pick one or the other on different branches). DWML, /16

6 Problems Correlated features can skew prediction Irrelevant features (not correlated to class variable) cause unnecessary blowup of model space (search space) Irrelevant features can drown the information provided by informative features in noise (e.g. distance function dominated by random values of many uninformative features) Irrelevant features in a model reduce its explanatory value (also when predictive accuracy is not reduced). Methods of feature selection Define relevance of features, and filter out irrelevant features before learning (relevance independent of used model). Filter features based on model-specific criteria (e.g. eliminate highly correlated features for Naive Bayes). Wrapper approach: evaluate feature subsets by model performance. DWML, /16

7 Relevance A possible definition: A feature A i is irrelevant if for all a states(a i ) and c states(c) P(C = c A i = a) = P(C = c). Limitations of relevance based filtering: Even if Ai is irrelevant, it may become relevant in the presence of another feature, i.e. for some A j and a states(a j ): P(C = c A i = a, A j = a ) P(C = c A j = a ). For Naive Bayes: irrelevant features neither help nor hurt Irrelevance does not capture redundancy Generally: difficult to say in a data- and method-independent way what features are useful. DWML, /16

8 The Wrapper Approach [Kohavi,John 97] Search over possible feature subsets Candidate feature subsets v are evaluated: - Construct a model using features v using the given learning method (= induction algorithm). - Evaluate performance of the model using cross-validation - Assign score f(v): average predictive accuracy in cross-validation Best feature subset found is used to learn final model DWML, /16

9 The feature subset lattice for 4 attributes: Feature Selection Search E.g. 1, 0,1,0 represents feature subset {A 1, A 3 }. Search space too big for exhaustive search! DWML, /16

10 Greedy Search f(v)= 0.5 DWML, /16

11 Greedy Search f(v)= DWML, /16

12 Greedy Search f(v)= DWML, /16

13 Greedy Search f(v)= DWML, /16

14 Greedy Search f(v)= Search terminates when no score improvement obtained by expansion. DWML, /16

15 Best First Search open: f(v)= 0.5 closed: best: DWML, /16

16 Best First Search open: f(v)= 0.5 closed: best: DWML, /16

17 Best First Search open: f(v)= 0.5 closed: best: DWML, /16

18 Best First Search open: f(v)= 0.5 closed: best: DWML, /16

19 Best First Search open: f(v)= 0.5 closed: best: DWML, /16

20 Best First Search open: f(v)= 0.5 closed: best: DWML, /16

21 Best First Search open: f(v)= 0.5 closed: best: DWML, /16

22 Best First Search open: f(v)= 0.5 closed: best: DWML, /16

23 Best First Search open: f(v)= 0.5 closed: best: DWML, /16

24 Best First Search Seach continues until k consecutive expansions have not generated any score improvement for feature subset. Can also be used as anytime algorithm: search continues indefinitely, current best is always available. DWML, /16

25 Experimental Results: Accuracy Results from [Kohavi,John 97]. Table gives accuracy in 10-fold cross-validation. Comparison of algorithm using all features, and algorithm using selected features (-FSS). Here with greedy search. DWML, /16

26 Experimental Results: Number of Featues Results from [Kohavi,John 97]. Average number of features selected in 10-fold cross-validation. DWML, /16

27 Experimental Results: Accuracy Results from [Kohavi,John 97]. Table gives accuracy in 10-fold cross-validation. Comparison of feature subset selection using hill climbing and best first search. DWML, /16

28 Experimental Results: Number of Featues Results from [Kohavi,John 97]. Average number of features selected in 10-fold cross-validation. DWML, /16

29 Feature Generation Building new features Discretization of continuous attributes Value grouping: e.g. reduce date of sale to month of sale Synthesize new features: e.g. from A1, A 2 (continuous) compute A new := A 1 /A 2 DWML, /16

Data Preprocessing. Slides by: Shree Jaswal

Data Preprocessing. Slides by: Shree Jaswal Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data