Feature selection. Term 2011/2012 LSI - FIB. Javier Béjar cbea (LSI - FIB) Feature selection Term 2011/ / 22

Size: px

Start display at page:

Download "Feature selection. Term 2011/2012 LSI - FIB. Javier Béjar cbea (LSI - FIB) Feature selection Term 2011/ / 22"

Harvey Snow
5 years ago
Views:

1 Feature selection Javier Béjar cbea LSI - FIB Term 2011/2012 Javier Béjar cbea (LSI - FIB) Feature selection Term 2011/ / 22

2 Outline 1 Dimensionality reduction 2 Projections 3 Attribute selection Javier Béjar cbea (LSI - FIB) Feature selection Term 2011/ / 22

3 Dimensionality reduction 1 Dimensionality reduction 2 Projections 3 Attribute selection Javier Béjar cbea (LSI - FIB) Feature selection Term 2011/ / 22

4 Dimensionality reduction High Dimensional Data There are some problems that arise from the dimensionality of a dataset A good hypothesis is more difficult to find The quality of the data (noise/irrelevant information) The computational cost of processing the data (scalability of the algorithms) There are two elements that define the dimensionality of a dataset The number of examples The number of attributes Usually the problem of having too many examples can be solved using sampling. The reduction of the number of attributes has different approaches Javier Béjar cbea (LSI - FIB) Feature selection Term 2011/ / 22

5 Reducing attributes Dimensionality reduction Usually the number of attributes of the dataset has an impact on the performance of the algorithms: Because their poor scalability (cost is a function of the number of attributes) Because the inability to cope with irrelevant/noisy/redundant attributes There are two main methodologies to reduce the number of attributes of a dataset Transforming the data to a space of less dimensions preserving somewhat the original data (dimensionality reduction) Eliminating the attributes that are not relevant for the goal task (feature subset selection) Javier Béjar cbea (LSI - FIB) Feature selection Term 2011/ / 22

6 Dimensionality reduction Dimensionality reduction We are looking for a new dataset that preserves the information of the original dataset but has less attributes Many techniques have been developed for this purpose Projection to a space that preserve the statistical model of the data (PCA, ICA) Projection to a space that preserves distances among the data (Singular Values Decomposition, Multidimensional Scaling, Random Projection, Nonlinear Scaling) Javier Béjar cbea (LSI - FIB) Feature selection Term 2011/ / 22

7 Projections 1 Dimensionality reduction 2 Projections 3 Attribute selection Javier Béjar cbea (LSI - FIB) Feature selection Term 2011/ / 22

8 Projections Projections We transform the dataset to another feature space Principal Component Analysis: We assume that attributes follow a gaussian distribution Data is projected to a set of orthogonal dimensions (components) that are linear combination of the original attributes. Global variance is preserved. The new dimensions are uncorrelated and can be ordered by the original information they preserve. We can keep the subset that preserves the most information Independent Component Analysis: We assume non gaussian data. Transforms the dataset projecting the data to a set of variables statistically independent (all statistical momentums are independent). Javier Béjar cbea (LSI - FIB) Feature selection Term 2011/ / 22

9 Projections Principal Component Analysis We look for a projection of the original space to a space with orthogonal dimensions (linearly independent) Y w1*y+w2*x w3*y+w4*x X Javier Béjar cbea (LSI - FIB) Feature selection Term 2011/ / 22

10 Projections Principal Component Analysis Principal Components are an ordered set of vectors that are the best linear approximation of the data: f (λ) = µ + V q λ µ is a location vector in R p, V q is a p q matrix of q orthogonal unit vectors and λ is a q vector of parameters We want to minimize the reconstruction error for the data (the quadratic loss): mín µ,{λ i },V q n x i µ V q λ i 2 i=1 Javier Béjar cbea (LSI - FIB) Feature selection Term 2011/ / 22

11 Projections Principal Component Analysis Optimizing partially for µ and λ i : µ = x λ i = Viq T (x i x) We can obtain the matrix V q by minimizing: mín V q n (x i x) V q Vq T (x i x) 2 i=0 This problem has many solutions Javier Béjar cbea (LSI - FIB) Feature selection Term 2011/ / 22

12 Projections Principal Component Analysis Assuming x = 0 we can rewrite the problem as: mín V q n x i V q Vq T x i 2 i=0 The projection matrix H q = V q Vq T ways can be obtained in two different 1 From the diagonalization of the covariance matrix 1 n XX T = PDP 1, being the diagonal matrix D the eigenvalues and the ortogonal matrix P the eigenvectors. The magnitudes of the eigenvalues are proportional to the reconstruction error 2 From the SVD decomposition of the data matrix X = UDV T, U is a n p orthogonal matrix, its columns are the left singular vectors, V is a p p diagonal matrix with diagonal values ordered called the singular values, the columns of UD are the principal components Javier Béjar cbea (LSI - FIB) Feature selection Term 2011/ / 22

13 Multidimensional Scaling Projections A transformation matrix transforms a dataset from M dimensions to N dimensions preserving pairwise data distances [MxN] Javier Béjar cbea (LSI - FIB) Feature selection Term 2011/ / 22

14 Projections Multidimensional Scaling Multidimensional Scaling: Projects a dataset to a space with less dimensions preserving the pair distances among the data A projection matrix is obtained by optimizing a function of the pairwise distances (stress function) This means that the actual attributes are not used in the transformation There are different objective functions that can be used (least squares, Sammong mapping, classical scaling,...). The optimization problem is solved by gradient descent Javier Béjar cbea (LSI - FIB) Feature selection Term 2011/ / 22

15 Projections - Example Projections Data Classical MDS ISOMAP PCA Javier Béjar cbea (LSI - FIB) Feature selection Term 2011/ / 22

16 Attribute selection 1 Dimensionality reduction 2 Projections 3 Attribute selection Javier Béjar cbea (LSI - FIB) Feature selection Term 2011/ / 22

17 Attribute selection Attribute selection - Filters Filters: We assume that we have an evaluation measure that allows for each attribute to assess its relevance respect with the target concept A ranking of the attributes is computed from the individual relevance to the target (computationally cheap) From the ranking a cutting point is decided and the firsts in the ranking are selected Examples: Entropy (ID3), χ 2 test, Relief A1 A2 A3 A4 C Ev(A1,C) > Ev(A2,C)>Ev(A3,C)>Ev(A4,C) Ev A1 A2 A3 A4 Javier Béjar cbea (LSI - FIB) Feature selection Term 2011/ / 22

18 Attribute selection Attribute selection - Filter - Relief-F Relief-F is a filter method based on K-nearest neighbors We assume that the attributes that are able to classify correctly an example are the best ones The accuracy of an attribute is estimated by local approximation (k-nn) For a subset of examples: Retrieve the k-nn from the same class Retrieve the k-nn from other class For each attribute, accumulate positive or negative weights for the coincidences or not in attribute values Javier Béjar cbea (LSI - FIB) Feature selection Term 2011/ / 22

19 Attribute selection Attribute selection - Filter - Relief-F Procedure: Relief-F Input: W vector of feature weights initialized to 0 X random sample of the dataset foreach x X do NH k nearest neighbors of x of the same class (near hit) NM k nearest neighbors of x of different class (near miss) foreach n NM and all features i do if n i x i then decrease w i value foreach n NH and all features i do if n i x i then increase w i value Javier Béjar cbea (LSI - FIB) Feature selection Term 2011/ / 22

20 Attribute selection Attribute selection - Wrappers Wrappers We take in account the interaction among attributes Subsets of features are evaluated until the more adequate is found (2 n subsets) A learning algorithm is used to asses the quality of each subset Exhaustive search is unfeasible Local Search: Hill-climbing, Simulated Annealing, Best First, Beam Search, Genetic algorithms,... Two greedy search strategies: Forward Selection, Backward Elimination Javier Béjar cbea (LSI - FIB) Feature selection Term 2011/ / 22

21 Attribute selection Attribute selection - Wrappers A1 A2 A3 A4 C (A1,A2,A3) M (A1,A2) M (A1) M (A2) M (A1,A2,A3,A4) M (A1,A2,A4) (A1,A3,A4) M M (A1,A4) (A2,A4) M M (A2,A3,A4) M Backward Elimination Javier Béjar cbea (LSI - FIB) Feature selection Term 2011/ / 22

Data Preprocessing. Javier Béjar. URL - Spring 2018 CS - MAI 1/78 BY: $\

$Data Preprocessing. Javier Béjar. URL - Spring 2018 CS - MAI 1/78 BY: $\$ Data Preprocessing Javier Béjar BY: $\ URL - Spring 2018 C CS - MAI 1/78 Introduction Data representation Unstructured datasets: Examples described by a flat set of attributes: attribute-value matrix Structured