Data Preprocessing Javier Béjar BY: $\ URL - Spring 2018 C CS - MAI 1/78
Introduction
Data representation Unstructured datasets: Examples described by a flat set of attributes: attribute-value matrix Structured datasets: Individual examples described by attributes but with relations among them: sequences (time, spatial,...), trees, graphs Sets of structured examples (sequences, graphs, trees) URL - Spring 2018 - MAI 2/78
Unstructured data Only one table of observations Each example represents an instance of the problem Each instance is represented by a set of attributes (discrete, continuous) A B C 1 3.1 a 1 5.7 b 0-2.2 b 1-9.0 c 0 0.3 d 1 2.1 a...... URL - Spring 2018 - MAI 3/78
Structured data One sequential relation among instances (Time, Strings) Several instances with internal structure Subsequences of unstructured instances One large instance Several relations among instances (graphs, trees) Several instances with internal structure One large instance URL - Spring 2018 - MAI 4/78
Data Streams Endless sequence of data Several streams synchronized Unstructured instances Structured instances Static/Dynamic model URL - Spring 2018 - MAI 5/78
Data representation Most unsupervised learning algorithms are specifically fitted for unstructured data The data representation is equivalent to a database table (attribute-value pairs) Specialized algorithms have been developed for structured data: Graph clustering, Sequence mining, Frequent substructures The representation of these types of data is sometimes algorithm dependent URL - Spring 2018 - MAI 6/78
Data Preprocessing
Data preprocessing Usually raw data is not directly adequate for analysis The usual reasons: The quality of the data (noise/missing values/outliers) The dimensionality of the data (too many attributes/too many examples) The first step of any data task is to assess the quality of the data The techniques used for data preprocessing are usually oriented to unstructured data URL - Spring 2018 - MAI 7/78
Outliers Outliers: Examples with extreme values compared to the rest of the data Can be considered as examples with erroneous values Have an important impact on some algorithms URL - Spring 2018 - MAI 8/78
Outliers The exceptional values could appear in all or only a few attributes The usual way to correct this problem is to eliminate the examples If the exceptional values are only in a few attributes these could be treated as missing values URL - Spring 2018 - MAI 9/78
Parametric Outliers Detection Assumes a probabilistic distribution for the attributes Univariate Perform Z-test or student s test Multivariate Deviation method: reduction in data variance when eliminated Angle based: variance of the angles to other examples Distance based: variation of the distance from the mean of the data in different dimensions URL - Spring 2018 - MAI 10/78
Non parametric Outliers Detection Histogram based: Define a multidimensional grid and discard cells with low density Distance based: Distance of outliers to their k-nearest neighbors are larger Density based: Approximate data density using Kernel Density estimation or heuristic measures (Local Outlier Factor, LOF) URL - Spring 2018 - MAI 11/78
Outliers: Local Outlier Factor LOF quantifies the outlierness of an example adjusting for variation in data density Uses the distance of the k-th neighbor D k (x) of an example and the set of examples that are inside this distance L k (x) The reachability distance between two data points R k (x, y) is defined as the maximum between the distance dist(x, y) and the y s k-th neighbor distance L k (y) URL - Spring 2018 - MAI 12/78
Outliers: Local Outlier Factor The average reachability distance AR k (x) with respect of an example s neighborhood (L k (x)) is defined as the average of the reachabilities of the example to its neighbors The LOF of an example is computed as the mean ratio between AR k (x) and the average reachability of its k neighbors: LOF k (x) = 1 k y L k (x) AR k (x) AR k (y) This value ranks all the examples URL - Spring 2018 - MAI 13/78
Outliers: Local Outlier Factor 5 5 4 4 3 3 2 2 1 1 0 0 1 1 2 0 2 4 6 2 0 2 4 6 URL - Spring 2018 - MAI 14/78
Missing values Missing values appear because of errors or omissions during the gathering of the data They can be substituted to increase the quality of the dataset (value imputation) Global constant for all the values Mean or mode of the attribute (global central tendency) Mean or mode of the attribute but only of the k nearest examples (local central tendency) Learn a model for the data (regression, bayesian) and use it to predict the values Problem: The statistical distribution of the data is changed URL - Spring 2018 - MAI 15/78
Missing values Missing Values Mean substitution 1-neighbor substitution URL - Spring 2018 - MAI 16/78
Normalization Normalizations are applied to quantitative attributes in order to eliminate the effect of having different scale measures Range normalization: Transform all the values of the attribute to a preestablished scale (e.g.: [0,1], [-1,1]) x x min x max x min Distribution normalization: Transform the data to a specific statistical distribution with preestablished parameters (e.g.: Gaussian N (0, 1)) x µ x σ x URL - Spring 2018 - MAI 17/78
Discretization Discretization allows transforming quantitative attributes to qualitative attributes Equal size bins: Pick the number of values and divide the range of data in equal zised bins Equal frequency bins: Pick the number of values and divide the range of data so each bin has the same number of examples (the size of the intervals will be different) URL - Spring 2018 - MAI 18/78
Discretization Discretization allows transforming quantitative attributes to qualitative attributes Distribution approximation: Calculate a histogram of the data and fit a kernel function (KDE), the intervals are where the function has its minima Other techniques: Apply entropy based measures, MDL or even clustering URL - Spring 2018 - MAI 19/78
Discretization Same size Same Frequency Histogram URL - Spring 2018 - MAI 20/78
Python Notebooks These two Python Notebooks show some examples of the effect of missing values imputation and data discretization and normalization Missing Values Notebook (click here to go to the url) Preprocessing Notebook (click here to go to the url) If you have downloaded the code from the repository you will be able to play with the notebooks (run jupyter notebook to open the notebooks) URL - Spring 2018 - MAI 21/78
Dimensionality Reduction
The curse of dimensionality Problems due to the dimensionality of data The computational cost of processing the data The quality of the data Elements that define the dimensionality of data The number of examples The number of attributes Usually the problem of having too many examples can be solved using sampling. URL - Spring 2018 - MAI 22/78
Reducing attributes The number of attributes has an impact on the performance: Poor scalability Inability to cope with irrelevant/noisy/redundant attributes Methodologies to reduce the number of attributes: Dimensionality reduction: Transforming to a space of less dimensions Feature subset selection: Eliminating not relevant attributes URL - Spring 2018 - MAI 23/78
Dimensionality reduction New dataset that preserves the information of the original data but with less attributes Many techniques have been developed for this purpose Projection to a space that preserve the statistical distribution of the data (PCA, ICA) Projection to a space that preserves distances among the data (Multidimensional scaling, random projection, nonlinear scaling) URL - Spring 2018 - MAI 24/78
Component analysis Principal Component Analysis: Data is projected onto a set of orthogonal dimensions (components) that are linear combination of the original attributes The components are uncorrelated and are ordered by the information they have We assume data follows gaussian distribution Global variance is preserved URL - Spring 2018 - MAI 25/78
Principal Component Analysis Computes a projection matrix where the dimensions are orthogonal (linearly independent) and data variance is preserved Y w1*y+w2*x w3*y+w4*x X URL - Spring 2018 - MAI 26/78
Principal Component Analysis Principal components: vectors that are the best linear approximation of the data f (λ) = µ + V q λ µ is a location vector in R p, V q is a p q matrix of q orthogonal unit vectors and λ is a q vector of parameters The reconstruction error for the data is minimized: min µ,{λ i },V q N x i µ V q λ i 2 i=1 URL - Spring 2018 - MAI 27/78
Principal Component Analysis - Computation Optimizing partially for µ and λ i : µ = x λ i = V T iq (x i x) We can obtain the matrix V q by minimizing: min V q N (x i x) V q Vq T (x i x) 2 2 i=0 URL - Spring 2018 - MAI 28/78
Principal Component Analysis - Computation Assuming x = 0 we can obtain the projection matrix H q = V q Vq T by Singular Value Decomposition of the data matrix X X = UDV T U is a N p orthogonal matrix, its columns are the left singular vectors V is a p p diagonal matrix with ordered diagonal values called the singular values URL - Spring 2018 - MAI 29/78
Principal Component Analysis The columns of UD are the principal components The solution to the minimization problem are the first q principal components The singular values are proportional to the reconstruction error URL - Spring 2018 - MAI 30/78
Kernel PCA PCA is a linear transformation, this means that if data is linearly separable, the reduced dataset will be linearly separable (given enough components) We can use the kernel trick to map the original attribute to a space where non linearly separable data is linearly separable Distances among examples are defined as a dot product that can be obtained using a kernel: d(x i, x j ) = Φ(x i ) T Φ(x j ) = K(x i, x j ) URL - Spring 2018 - MAI 31/78
Kernel PCA Different kernels can be used to perform the transformation to the feature space (polynomial, gaussian,...) The computation of the components is equivalent to PCA but performing the eigen decomposition of the covariance matrix computed for the transformed examples C = 1 M M Φ(x j )Φ(x j ) T j=1 URL - Spring 2018 - MAI 32/78
Kernel PCA Pro: Helps to discover patterns that are non linearly separable in the original space Con: Does not give a weight/importance for the new components URL - Spring 2018 - MAI 33/78
Kernel PCA URL - Spring 2018 - MAI 34/78
Sparse PCA PCA transforms data to a space of the same dimensionality (all eigenvalues are non zero) An alternative is to solve the minimization problem posed by the reconstruction error using regularization A penalization term is added to the objective function proportional to the norm of the eigenvalues matrix min U,V X UV 2 2 + α V 1 The l-1 norm regularization will encourage sparse solutions (zero eigenvalues) URL - Spring 2018 - MAI 35/78
Multidimensional Scaling A transformation matrix transforms a dataset from M dimensions to N dimensions preserving pairwise data distances [MxN] URL - Spring 2018 - MAI 36/78
Multidimensional Scaling Multidimensional Scaling: Projects the data to a space with less dimensions preserving the pair distances among the data A projection matrix is obtained by optimizing a function of the pairwise distances (stress function) The actual attributes are not used in the transformation Different objective functions that can be used (least squares, Sammong mapping, classical scaling,...). URL - Spring 2018 - MAI 37/78
Multidimensional Scaling Least Squares Multidimensional Scaling (MDS) The distorsion is defined as the square distance between the original distance matrix and the distance matrix of the new data S D (z 1, z 2,..., z n ) = The problem is defined as: [ i i (d ii z i z i 2 ) 2 arg mins D (z 1, z 2,..., z n ) z 1,z 2,...,z n ] URL - Spring 2018 - MAI 38/78
Multidimensional Scaling Several optimization strategies can be used If the distance matrix is euclidean it can be solved using eigen decomposition just like PCA In other cases gradient descent can be used using the derivative of S D (z 1, z 2,..., z n ) and a step α in the following fashion: 1. Begin with a guess for Z 2. Repeat until convergence: Z (k+1) = Z (k) α S D (Z) URL - Spring 2018 - MAI 39/78
Multidimensional Scaling - Other functions Sammong Mapping (emphasis on smaller distances) [ ] (d ii z i z i S D (z 1, z 2,..., z n ) = )2 d ii i i Classical Scaling (similarity instead of distance) [ ] S D (z 1, z 2,..., z n ) = (s ii z i z, z i z ) 2 i i Non metric MDS (assumes a ranking among the distances, non euclidean space) i,i [θ( z S D (z 1, z 2,..., z n ) = i z i ) d ii ] 2 i,i d 2 i,i URL - Spring 2018 - MAI 40/78
Random Projection A random transformation matrix is generated: Rectangular matrix N d Columns must have unit length Elements are generated from a gaussian distribution A matrix generated this way is almost orthogonal The projection will preserve the relative distances among examples The effectiveness usually depends on the number of dimensions URL - Spring 2018 - MAI 41/78
Nonnegative Matrix Factorization (NMF) This formulation assumes that the data is a sum of unknown positive latent variables NMF performs an approximation of a matrix as the product of two matrices V = W H The main difference with PCA is that the values of the matrices are constrained to be positive The positiveness assumption helps to interpret the result Eg.: In text mining, a document is an aggregation of topics URL - Spring 2018 - MAI 42/78
Nonlinear scaling The previous methods perform a linear transformation between the original space and the final space For some datasets this kind of transformation is not enough to maintain the information of the original data Nonlinear transformations methods: ISOMAP Local Linear Embedding Local MDS URL - Spring 2018 - MAI 43/78
ISOMAP Assumes a low dimensional dataset embedded in a larger number of dimensions The geodesic distance is used instead of the euclidean distance The relation of an instance with its immediate neighbors is more representative of the structure of the data The transformation generates a new space that preserves neighborhood relationships URL - Spring 2018 - MAI 44/78
ISOMAP URL - Spring 2018 - MAI 45/78
ISOMAP Euclidean b a Geodesic URL - Spring 2018 - MAI 46/78
ISOMAP - Algorithm 1. For each data point find its k closest neighbors (points at minimal euclidean distance) 2. Build a graph where each point has an edge to its closest neighbors 3. Approximate the geodesic distance for each pair of points by the shortest path in the graph 4. Apply a MDS algorithm to the distance matrix of the graph URL - Spring 2018 - MAI 47/78
a ISOMAP - Example Original Transformed b b a URL - Spring 2018 - MAI 48/78
Local Linear Embedding Performs a transformation that preserves local structure Assumes that each instance can be reconstructed by a linear combination of its neighbors (weights) From this weights a new set of data points that preserve the reconstruction is computed for a lower set of dimensions Different variants of the algorithm exist URL - Spring 2018 - MAI 49/78
Local Linear Embedding URL - Spring 2018 - MAI 50/78
Local Linear Embedding - Algorithm 1. For each data point find the K nearest neighbors in the original space of dimension p (N (i)) 2. Approximate each point by a mixture of the neighbors: min x i w ik x k 2 W ik k N (i) where w ik = 0 if k N (i) and N i=0 w ik = 1 and K < p 3. Find points y i in a space of dimension d < p that minimize: N y i i=0 N w ik y k 2 URL - Spring 2018 - MAI 51/78 k=0
Local MDS Performs a transformation that preserves locality of closer points and puts farther away non neighbor points Given a set of pairs of points N where a pair (i, i ) belong to the set if i is among the K neighbors of i or viceversa Minimize the function: S L (z 1, z 2,..., z N ) = (i,i ) N (d ii z i z i ) 2 τ (i,i ) N The parameters τ controls how much the non neighbors are scattered ( z i z i ) URL - Spring 2018 - MAI 52/78
t-sne t-stochastic Neighbor Embedding (t-sne) Used as visualization tool Assumes distances define a probability distribution Obtains a low dimensional space with the closest distribution (in terms of Kullback-Leibler distance) Tricky to use (see this link) URL - Spring 2018 - MAI 53/78
Wheel chair control characterization Wheelchair with shared control (patient/computer) Recorded trajectories of several patients in different situations Angle/distance to the goal, Angle/distance to the nearest obstacle from around the chair (210 degrees) Characterization about how the computer helps the patients with different handicaps Is there any structure in the trajectory data? URL - Spring 2018 - MAI 54/78
Wheel chair control characterization URL - Spring 2018 - MAI 55/78
Wheel chair control characterization URL - Spring 2018 - MAI 56/78
Wheel chair control (PCA) URL - Spring 2018 - MAI 57/78
Wheel chair control (SparsePCA) URL - Spring 2018 - MAI 58/78
Wheel chair control (MDS) URL - Spring 2018 - MAI 59/78
Wheel chair control (ISOMAP Kn=3) URL - Spring 2018 - MAI 60/78
Wheel chair control (ISOMAP Kn=10) URL - Spring 2018 - MAI 61/78
Unsupervised Attribute Selection To eliminate from the dataset all the redundant or irrelevant attributes The original attributes are preserved Less developed than in Supervised Attribute Selection Problem: An attribute can be relevant or not depending on the goal of the discovery process There are mainly two techniques for attribute selection: Wrapping and Filtering URL - Spring 2018 - MAI 62/78
Attribute selection - Wrappers A model evaluates the relevance of subsets of attributes In supervised learning this is easy, in unsupervised learning it is very difficult Results depend on the chosen model and on how well this model captures the actual structure of the data URL - Spring 2018 - MAI 63/78
Attribute selection - Wrapper Methods Clustering algorithms that compute weights for the attributes based on probability distributions Clustering algorithms with an objective function that penalizes the size of the model Consensus clustering URL - Spring 2018 - MAI 64/78
Attribute selection - Filters A measure evaluates the relevance of each attribute individually This kind of measures are difficult to obtain for unsupervised tasks The idea is to obtain a measure that evaluates the capacity of each attribute to reveal the structure of the data (class separability, similarity of instances in the same class) URL - Spring 2018 - MAI 65/78
Attribute selection - Filter Methods Measures of properties of the spatial structure of the data (Entropy, PCA, laplacian matrix) Measures of the relevance of the attributes respect the inherent structure of the data Measures of attribute correlation URL - Spring 2018 - MAI 66/78
Laplacian Score The Laplacian Score is a filter method that ranks the features respect to their ability of preserving the natural structure of the data. This method uses the spectral matrix of the graph computed from the near neighbors of the examples URL - Spring 2018 - MAI 67/78
Laplacian Score Similarity is usually computed using a gaussian kernel (edges not present have a value of 0) S ij = e x i x j 2 σ And all edges not present have a value of 0. The Laplacian matrix is computed from the similarity matrix S and the degree matrix D as L = S D URL - Spring 2018 - MAI 68/78
Laplacian Score The score first computes for each attribute r and their values f r the transformation f r as: f r = f r f r T D1 1 T D1 1 and then the score L r is computed as: L r = f r T L f r D f r This gives a ranking for the relevance of the attributes f T r URL - Spring 2018 - MAI 69/78
Python Notebooks These two Python Notebooks show some examples dimensionality reduction and feature selection Dimensionality reduction and feature selection Notebook (click here to go to the url) Linear and non linear dimensionality reduction Notebook (click here to go to the url) If you have downloaded the code from the repository you will able to play with the notebooks (run jupyter notebook to open the notebooks) URL - Spring 2018 - MAI 70/78
Similarity functions
Similarity/Distance functions Unsupervised algorithms use similarity/distance to compare data This comparison will be obtained using functions of the attributes of the data We assume that data are embedded in a N-dimensional space where it can be defined a similarity/distance There are domains where this assumption is not true so some other kind of functions will be needed to represent data relationships URL - Spring 2018 - MAI 71/78
Properties of Similarity/Distance functions The properties of a similarity function are: 1. s(p, q) = 1 p = q 2. p, q s(p, q) = s(q, p) The properties of a distance function are: 1. p, q d(p, q) 0 and p, q d(p, q) = 0 p = q 2. p, q d(p, q) = d(q, p) 3. p, q, r d(p, r) d(q, p) + d(p, r) URL - Spring 2018 - MAI 72/78
Distance functions Minkowski metrics (Manhattan, Euclidean) d(i, k) = ( d j=1 x ij x kj r ) 1 r Mahalanobis distance d(i, k) = (x i x k ) T ϕ 1 (x i x k ) where ϕ is the matrix covariance of the attributes URL - Spring 2018 - MAI 73/78
Distance functions Chebyshev Distance d(i, k) = max x ij x kj j Camberra distance d(i, k) = d j=1 x ij x kj x ij + x kj URL - Spring 2018 - MAI 74/78
Similarity functions Cosine similarity d(i, k) = xi T x j x i x j Pearson correlation measure d(i, k) = (x i x i ) T (x j x j ) x i x i x j x j URL - Spring 2018 - MAI 75/78
Similarity functions Coincidence coefficient s(i, k) = a 00 a 11 d Jaccard coefficient s(i, k) = a 11 a 00 + a 01 + a 10 URL - Spring 2018 - MAI 76/78
Python Notebooks This Python Notebook shows examples of using different distance functions Distance functions Notebook (click here to go to the url) If you have downloaded the code from the repository you will able to play with the notebooks (run jupyter notebook to open the notebooks) URL - Spring 2018 - MAI 77/78
Python Code In the code from the repository inside subdirectory DimReduction you have the Authors python script The code uses the datasets in the directory Data/authors Auth1 has fragments of books that are novels or philosophy works Auth2 has fragments of books written in English and books translated to English The code transforms the text to attribute vectors and applies different dimensionality reduction algorithms Modifying the code you can process one of the datasets and choose how the text is transformed into vectors URL - Spring 2018 - MAI 78/78