COMP61011 Foundations of Machine Learning. Feature Selection

Size: px

Start display at page:

Download "COMP61011 Foundations of Machine Learning. Feature Selection"

Damian Robinson
5 years ago
Views:

1 OMP61011 Foundations of Machine Learning Feature Selection

2 Pattern Recognition: The Early Days Only 200 papers in the world! I wish!

3 Pattern Recognition: The Early Days Using eight very simple measurements [...] arecognitionrateof95percent on sampled and fresh material (using 50 specimens of each of the hand-printed letters A, and, and a self-organizing computer program based on the above considerations). [Rutovitz, 1966] 0~~~~~~~~~~ 0 )~~~~~~~~~~~~~~00 0 * 111 IN%. *to 0 * t Ot *0,,~ t+l, 1.t#t tf*+4t*n 5 0 * X 0 t A * ~ W~ 0 0 ~~~~ ~~~~~~~ ** t*0 I o 3c* X X* 0 O0 3 0 * 0 o f t Z I Xz: t nn ~ a b*0 0 I *.E 0 :r N0 I 0 0 ~ ~ ~ jni0 N~~O 3 0 I0 001X2 o.00nat z1 0- * 0 I o los X~ U 0 N. X Z*. * 41: * 0N V* 0 1, a0 0 0 X 1 o ll l 0 X t 00 0 X:! ~ ~ *0N.N0 *o 0 0 *:~~~*-*..~~:~* *0: 0 I~ Du 0 *** *I I0. 0 Q I 0 n ~~~~~~~~~~ ~ ~ ~ ) 0 0 = ritish roadcasting orporation

4 Pattern Recognition: Then and Now Image recognition still a major issue. ut we ve gone beyond 8 8 characters and dot-matrix printers! Then... Now!

5 Square Kilometre Array (due 2024) World s largest radio telescope array 1 terabyte per second Need to classify stellar objects real-gme.

Supervised Learning Provided with N examples of the correct behavior: D = {X,Y} X = 0 @ x (1) 1 x

.. y (N) 1 A 0 @ 1 A 0 @ 1 A Some Terminology: D is the dataset. Each row of X is an example, a.k.a. datapoint/pattern.

6 Supervised Learning Provided with N examples of the correct behavior: D = {X,Y} X = x (1) 1 x (1) 2,... x(1) d... x (N) 1 x (N) 2,... x (N) d 1 A Y = y (1)... y (N) 1 A 1 A 1 A Some Terminology: D is the dataset. Each row of X is an example, a.k.a. datapoint/pattern. Each column of X is a feature, a.k.a. variable/input/attribute. Y is the vector of labels, a.k.a. targets/classes.

7 Supervised Learning Training data + labels Possibly high dimensional. Test input Model Label prediction

8 High Dimensional Data (this is real, on a US stick on my desk 41,672 features, 59 patients)

9 Supervised Learning Training data + labels Test input Model Label prediction

10 Supervised Learning + Feature Selection Training data + labels Select subset of features (i.e. columns) Test input Model Label prediction

11 The Wrapper approach You want to build a model so just do it. an we just do an exhaustive search? it set to 1 means we use that feature, otherwise so use 8 features. Try a feature set With M total features 2 M possible sets! 20 features 1 million feature sets to check 25 features 33.5 million sets 30 features 1.1 billion sets Model Evaluate the model

12 The Wrapper approach You want to build a model so just do it. Simplest strategy: greedy search REPEAT: 1. Try out each of the remaining features with your model. 2. Add the best one. Try a feature set UNTIL satisfied with accuracy/error Model Evaluate the model

13 Number of Evaluations Necesary Exhaustive Forward/ackward Number of Features

14 Visualising the search space Greedy forward search evaluates M(M + 1) 2 sets

15 Why can t we get a bigger computer? With M features! 2 M possible feature subsets. Exhaustive enumeration feasible only for small (M 20) domains. ould use clever search (Genetic Algs, Simulated Annealing, etc). but ultimately... NP-hard problem!

16 Maybe we cannot, or don t want to, build a classifier. How inherently useful is a feature?

17 an we say how useful a feature is? Imagine you re trying to guess the price of a car. Relevant : engine size, age, mileage, presence of rust, Irrelevant : color of windscreen wipers, size of wheels, stickers on window, Redundant : age / mileage.

18 Filters A filter evaluates statistics of the data Univariate filters evaluate each feature independently. Multivariate filters evaluate features in context of others. also... Some data is ordered. e.g. 1,2,3 Some is not, e.g. dog, cat, sheep (i.e. categorical) A filter statistic must take this into account.

19 Relevancy = orrelation? How often have you heard the phrase X is correlated with Y?

20 Pearson s orrelation oe Feature : x k = {x (1) k,...,x(n) k } T Target : y = {y (1),...,y (N) } T cient r(x, y) = P N i=1 (x(i) x)(y (i) ȳ) q PN q PN i=1 (x(i) x) 2 i=1 (y(i) ȳ) 2 r =+0.5 r =0.0 r = 0.5

21 Pearson s orrelation oe x k = {x (1) k,...,x(n) k } k =1..M y = {y (1),...,y (N) } cient The estimated utility for feature X k is: J(X k )= r(x k, y) (i.e. absolute correlation with target) Algorithm 10. Rank features in descending order by J. 20. Evaluate predictor on M nested subsets. 30. hoose subset with lowest validation error.

22 All these have r = Pearson only detects LINEAR relationships..and it is only for one feature ( univariate )..and it is assuming two real-valued variables.

24 How about a classification problem? Let s use a simple threshold on variable X. Each point is a person in your database. Green stars = good health. Red circles = bad health. High Useful feature. Discriminates very well. Low

25 How about a classification problem? Let s use a simple threshold on variable X. Each point is a person in your database. Green stars = good health. Red circles = bad health. High No useful threshold! Feature is not discriminative. Low

26 Fisher Score F = (m1 m2) 2 v1 + v2 m1 m2 v1 v2 (m1 m2) 2. is called the between-class scatter IG for good features. v1 + v2. is called the within-class scatter SMALL for good features.

27 Fisher Score F = (m1 m2) 2 v1 + v2 m1 m2 v1 v2 (m1 m2) 2. is called the between-class scatter IG for good features. v1 + v2. is called the within-class scatter SMALL for good features.

28 How useful is a single measurement? Imagine a feature Small value ig value Guyon & Elisseeff, Introduction to Feature Selection, Journal of Machine Learning Research 2004

29 onsidering features together High Low Small value ig value Guyon & Elisseeff, Introduction to Feature Selection, Journal of Machine Learning Research 2004

30 Two irrelevant features may be relevant together Guyon & Elisseeff, Introduction to Feature Selection, Journal of Machine Learning Research 2004

31 How useful is a feature? Need some kind of dependency measure e.g. Pearson s correlation. but assumes linearity Fisher score.. but assumes gaussianity And both ignore feature interactions.

32 Mutual Information X is relevant to Y if they are dependent, i.e. p(y x) 6= p(y), or... p(xy) 6= p(x)p(y) So let s measure the KL-divergence between these distributions: J(X k )=I(X k ; Y )= X x2x k X y2y p(xy) log p(xy) p(x)p(y) We rank features by their score J.

33 Mutual InformaGon J(X k )=I(X k ; Y )= X x2x k X y2y p(xy) log p(xy) p(x)p(y) Measures dependency of X,Y Zero when independent. Maximal when identical.

34 Filter methods: Three Ingredients 1. Dependency measure 2. Search procedure 3. Stopping criterion X Y J(X;Y) = 0.6 Select / discard? Selected set S. Iteratively add/remove features. Select most relevant features. Discard irrelevant features. J(X;Y) is the dependency criterion. e.g. Pearson s correlation Fisher score Mutual Information

35 Filter Ranking using Mutual Information Rank features X k, 8k by their values of J = I(X k ; Y ). Retain the highest ranked features, discard the lowest ranked. i J(X k ) ut-o point decided by user, e.g. S =5,so S = {35, 42, 10, 654, 22}. but... what if I tell you features 42 and 10 are almost identical?!

36 Feature SelecGon Useful to: - Reduce chance of overfipng - Reduce computagonal complexity at test Gme - Increase interpretability Many methods: - Wrappers vs Filters, pros and cons of each - Many variants of filters.

37 This is the End of OMP That s it. We re done. Exam in January past papers on website. Projects due next Friday, 4pm

Feature Selection. A PhD Seminar Uni Cagliari. Gavin Brown School of Computer Science University of Manchester

Feature Selection. A PhD Seminar Uni Cagliari. Gavin Brown School of Computer Science University of Manchester Feature Selection A PhD Seminar Course @ Uni Cagliari Gavin Brown School of Computer Science University of Manchester Me Grew up near London. First degree in Computer Science (1998) PhD multiple classifier