Data Preprocessing. Javier Béjar. URL - Spring 2018 CS - MAI 1/78 BY: $\

Similar documents
Data Preprocessing. Javier Béjar AMLT /2017 CS - MAI. (CS - MAI) Data Preprocessing AMLT / / 71 BY: $\

Feature selection. Term 2011/2012 LSI - FIB. Javier Béjar cbea (LSI - FIB) Feature selection Term 2011/ / 22

Unsupervised Machine Learning and Data Mining

Unsupervised Learning

Dimension Reduction CS534

CSE 6242 A / CS 4803 DVA. Feb 12, Dimension Reduction. Guest Lecturer: Jaegul Choo

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Network Traffic Measurements and Analysis

Overview of Clustering

CSE 6242 A / CX 4242 DVA. March 6, Dimension Reduction. Guest Lecturer: Jaegul Choo

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

CSE 6242 / CX October 9, Dimension Reduction. Guest Lecturer: Jaegul Choo

Lecture Topic Projects

Machine Learning: Think Big and Parallel

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

CPSC 340: Machine Learning and Data Mining. Multi-Dimensional Scaling Fall 2017

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Unsupervised Learning

Clustering and Visualisation of Data

( ) =cov X Y = W PRINCIPAL COMPONENT ANALYSIS. Eigenvectors of the covariance matrix are the principal components

Nonlinear projections. Motivation. High-dimensional. data are. Perceptron) ) or RBFN. Multi-Layer. Example: : MLP (Multi(

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

3 Feature Selection & Feature Extraction

MSA220 - Statistical Learning for Big Data

Non-linear dimension reduction

Proximity and Data Pre-processing

Machine Learning for OR & FE

Data mining. Classification k-nn Classifier. Piotr Paszek. (Piotr Paszek) Data mining k-nn 1 / 20

Content-based image and video analysis. Machine learning

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Exploratory Data Analysis using Self-Organizing Maps. Madhumanti Ray

Gene Clustering & Classification

Statistical Pattern Recognition

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

ADVANCED MACHINE LEARNING. Mini-Project Overview

CS 521 Data Mining Techniques Instructor: Abdullah Mueen

Unsupervised learning in Vision

Mixture Models and the EM Algorithm

10-701/15-781, Fall 2006, Final

Robust Kernel Methods in Clustering and Dimensionality Reduction Problems

Clustering and Dimensionality Reduction. Stony Brook University CSE545, Fall 2017

Spatial Outlier Detection

A Survey on Pre-processing and Post-processing Techniques in Data Mining

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011

Statistical Pattern Recognition

Using Machine Learning to Optimize Storage Systems

A Course in Machine Learning

Courtesy of Prof. Shixia University

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

Data preprocessing Functional Programming and Intelligent Algorithms

Understanding Clustering Supervising the unsupervised

Machine Learning for Signal Processing Clustering. Bhiksha Raj Class Oct 2016

Geometric Registration for Deformable Shapes 3.3 Advanced Global Matching

Missing Data Estimation in Microarrays Using Multi-Organism Approach

Non-Bayesian Classifiers Part I: k-nearest Neighbor Classifier and Distance Functions

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

Today. Gradient descent for minimization of functions of real variables. Multi-dimensional scaling. Self-organizing maps

Data Mining: Data. What is Data? Lecture Notes for Chapter 2. Introduction to Data Mining. Properties of Attribute Values. Types of Attributes

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Clustering. So far in the course. Clustering. Clustering. Subhransu Maji. CMPSCI 689: Machine Learning. dist(x, y) = x y 2 2

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

Week 7 Picturing Network. Vahe and Bethany

Multiresponse Sparse Regression with Application to Multidimensional Scaling

CS325 Artificial Intelligence Ch. 20 Unsupervised Machine Learning

ECG782: Multidimensional Digital Signal Processing

Statistical Pattern Recognition

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

ECS 234: Data Analysis: Clustering ECS 234

CS 195-5: Machine Learning Problem Set 5

Vector Space Models: Theory and Applications

Data Preprocessing. Data Mining 1

Alternative Statistical Methods for Bone Atlas Modelling

Voronoi Region. K-means method for Signal Compression: Vector Quantization. Compression Formula 11/20/2013

Unsupervised Learning

Grundlagen der Künstlichen Intelligenz

Clustering. Subhransu Maji. CMPSCI 689: Machine Learning. 2 April April 2015

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

A Dendrogram. Bioinformatics (Lec 17)

Generative and discriminative classification techniques

Clustering Lecture 5: Mixture Model

The Pre-Image Problem in Kernel Methods

Machine Learning Feature Creation and Selection

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 1st, 2018

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler, Sanjay Ranka

ECLT 5810 Clustering

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

INF 4300 Classification III Anne Solberg The agenda today:

Table Of Contents: xix Foreword to Second Edition

Contents. Preface to the Second Edition

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394

CLASSIFICATION AND CHANGE DETECTION

University of Florida CISE department Gator Engineering. Data Preprocessing. Dr. Sanjay Ranka

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data.

Chapter 9: Outlier Analysis

General Instructions. Questions

CSE 547: Machine Learning for Big Data Spring Problem Set 2. Please read the homework submission policies.

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Transcription:

Data Preprocessing Javier Béjar BY: $\ URL - Spring 2018 C CS - MAI 1/78

Introduction

Data representation Unstructured datasets: Examples described by a flat set of attributes: attribute-value matrix Structured datasets: Individual examples described by attributes but with relations among them: sequences (time, spatial,...), trees, graphs Sets of structured examples (sequences, graphs, trees) URL - Spring 2018 - MAI 2/78

Unstructured data Only one table of observations Each example represents an instance of the problem Each instance is represented by a set of attributes (discrete, continuous) A B C 1 3.1 a 1 5.7 b 0-2.2 b 1-9.0 c 0 0.3 d 1 2.1 a...... URL - Spring 2018 - MAI 3/78

Structured data One sequential relation among instances (Time, Strings) Several instances with internal structure Subsequences of unstructured instances One large instance Several relations among instances (graphs, trees) Several instances with internal structure One large instance URL - Spring 2018 - MAI 4/78

Data Streams Endless sequence of data Several streams synchronized Unstructured instances Structured instances Static/Dynamic model URL - Spring 2018 - MAI 5/78

Data representation Most unsupervised learning algorithms are specifically fitted for unstructured data The data representation is equivalent to a database table (attribute-value pairs) Specialized algorithms have been developed for structured data: Graph clustering, Sequence mining, Frequent substructures The representation of these types of data is sometimes algorithm dependent URL - Spring 2018 - MAI 6/78

Data Preprocessing

Data preprocessing Usually raw data is not directly adequate for analysis The usual reasons: The quality of the data (noise/missing values/outliers) The dimensionality of the data (too many attributes/too many examples) The first step of any data task is to assess the quality of the data The techniques used for data preprocessing are usually oriented to unstructured data URL - Spring 2018 - MAI 7/78

Outliers Outliers: Examples with extreme values compared to the rest of the data Can be considered as examples with erroneous values Have an important impact on some algorithms URL - Spring 2018 - MAI 8/78

Outliers The exceptional values could appear in all or only a few attributes The usual way to correct this problem is to eliminate the examples If the exceptional values are only in a few attributes these could be treated as missing values URL - Spring 2018 - MAI 9/78

Parametric Outliers Detection Assumes a probabilistic distribution for the attributes Univariate Perform Z-test or student s test Multivariate Deviation method: reduction in data variance when eliminated Angle based: variance of the angles to other examples Distance based: variation of the distance from the mean of the data in different dimensions URL - Spring 2018 - MAI 10/78

Non parametric Outliers Detection Histogram based: Define a multidimensional grid and discard cells with low density Distance based: Distance of outliers to their k-nearest neighbors are larger Density based: Approximate data density using Kernel Density estimation or heuristic measures (Local Outlier Factor, LOF) URL - Spring 2018 - MAI 11/78

Outliers: Local Outlier Factor LOF quantifies the outlierness of an example adjusting for variation in data density Uses the distance of the k-th neighbor D k (x) of an example and the set of examples that are inside this distance L k (x) The reachability distance between two data points R k (x, y) is defined as the maximum between the distance dist(x, y) and the y s k-th neighbor distance L k (y) URL - Spring 2018 - MAI 12/78

Outliers: Local Outlier Factor The average reachability distance AR k (x) with respect of an example s neighborhood (L k (x)) is defined as the average of the reachabilities of the example to its neighbors The LOF of an example is computed as the mean ratio between AR k (x) and the average reachability of its k neighbors: LOF k (x) = 1 k y L k (x) AR k (x) AR k (y) This value ranks all the examples URL - Spring 2018 - MAI 13/78

Outliers: Local Outlier Factor 5 5 4 4 3 3 2 2 1 1 0 0 1 1 2 0 2 4 6 2 0 2 4 6 URL - Spring 2018 - MAI 14/78

Missing values Missing values appear because of errors or omissions during the gathering of the data They can be substituted to increase the quality of the dataset (value imputation) Global constant for all the values Mean or mode of the attribute (global central tendency) Mean or mode of the attribute but only of the k nearest examples (local central tendency) Learn a model for the data (regression, bayesian) and use it to predict the values Problem: The statistical distribution of the data is changed URL - Spring 2018 - MAI 15/78

Missing values Missing Values Mean substitution 1-neighbor substitution URL - Spring 2018 - MAI 16/78

Normalization Normalizations are applied to quantitative attributes in order to eliminate the effect of having different scale measures Range normalization: Transform all the values of the attribute to a preestablished scale (e.g.: [0,1], [-1,1]) x x min x max x min Distribution normalization: Transform the data to a specific statistical distribution with preestablished parameters (e.g.: Gaussian N (0, 1)) x µ x σ x URL - Spring 2018 - MAI 17/78

Discretization Discretization allows transforming quantitative attributes to qualitative attributes Equal size bins: Pick the number of values and divide the range of data in equal zised bins Equal frequency bins: Pick the number of values and divide the range of data so each bin has the same number of examples (the size of the intervals will be different) URL - Spring 2018 - MAI 18/78

Discretization Discretization allows transforming quantitative attributes to qualitative attributes Distribution approximation: Calculate a histogram of the data and fit a kernel function (KDE), the intervals are where the function has its minima Other techniques: Apply entropy based measures, MDL or even clustering URL - Spring 2018 - MAI 19/78

Discretization Same size Same Frequency Histogram URL - Spring 2018 - MAI 20/78

Python Notebooks These two Python Notebooks show some examples of the effect of missing values imputation and data discretization and normalization Missing Values Notebook (click here to go to the url) Preprocessing Notebook (click here to go to the url) If you have downloaded the code from the repository you will be able to play with the notebooks (run jupyter notebook to open the notebooks) URL - Spring 2018 - MAI 21/78

Dimensionality Reduction

The curse of dimensionality Problems due to the dimensionality of data The computational cost of processing the data The quality of the data Elements that define the dimensionality of data The number of examples The number of attributes Usually the problem of having too many examples can be solved using sampling. URL - Spring 2018 - MAI 22/78

Reducing attributes The number of attributes has an impact on the performance: Poor scalability Inability to cope with irrelevant/noisy/redundant attributes Methodologies to reduce the number of attributes: Dimensionality reduction: Transforming to a space of less dimensions Feature subset selection: Eliminating not relevant attributes URL - Spring 2018 - MAI 23/78

Dimensionality reduction New dataset that preserves the information of the original data but with less attributes Many techniques have been developed for this purpose Projection to a space that preserve the statistical distribution of the data (PCA, ICA) Projection to a space that preserves distances among the data (Multidimensional scaling, random projection, nonlinear scaling) URL - Spring 2018 - MAI 24/78

Component analysis Principal Component Analysis: Data is projected onto a set of orthogonal dimensions (components) that are linear combination of the original attributes The components are uncorrelated and are ordered by the information they have We assume data follows gaussian distribution Global variance is preserved URL - Spring 2018 - MAI 25/78

Principal Component Analysis Computes a projection matrix where the dimensions are orthogonal (linearly independent) and data variance is preserved Y w1*y+w2*x w3*y+w4*x X URL - Spring 2018 - MAI 26/78

Principal Component Analysis Principal components: vectors that are the best linear approximation of the data f (λ) = µ + V q λ µ is a location vector in R p, V q is a p q matrix of q orthogonal unit vectors and λ is a q vector of parameters The reconstruction error for the data is minimized: min µ,{λ i },V q N x i µ V q λ i 2 i=1 URL - Spring 2018 - MAI 27/78

Principal Component Analysis - Computation Optimizing partially for µ and λ i : µ = x λ i = V T iq (x i x) We can obtain the matrix V q by minimizing: min V q N (x i x) V q Vq T (x i x) 2 2 i=0 URL - Spring 2018 - MAI 28/78

Principal Component Analysis - Computation Assuming x = 0 we can obtain the projection matrix H q = V q Vq T by Singular Value Decomposition of the data matrix X X = UDV T U is a N p orthogonal matrix, its columns are the left singular vectors V is a p p diagonal matrix with ordered diagonal values called the singular values URL - Spring 2018 - MAI 29/78

Principal Component Analysis The columns of UD are the principal components The solution to the minimization problem are the first q principal components The singular values are proportional to the reconstruction error URL - Spring 2018 - MAI 30/78

Kernel PCA PCA is a linear transformation, this means that if data is linearly separable, the reduced dataset will be linearly separable (given enough components) We can use the kernel trick to map the original attribute to a space where non linearly separable data is linearly separable Distances among examples are defined as a dot product that can be obtained using a kernel: d(x i, x j ) = Φ(x i ) T Φ(x j ) = K(x i, x j ) URL - Spring 2018 - MAI 31/78

Kernel PCA Different kernels can be used to perform the transformation to the feature space (polynomial, gaussian,...) The computation of the components is equivalent to PCA but performing the eigen decomposition of the covariance matrix computed for the transformed examples C = 1 M M Φ(x j )Φ(x j ) T j=1 URL - Spring 2018 - MAI 32/78

Kernel PCA Pro: Helps to discover patterns that are non linearly separable in the original space Con: Does not give a weight/importance for the new components URL - Spring 2018 - MAI 33/78

Kernel PCA URL - Spring 2018 - MAI 34/78

Sparse PCA PCA transforms data to a space of the same dimensionality (all eigenvalues are non zero) An alternative is to solve the minimization problem posed by the reconstruction error using regularization A penalization term is added to the objective function proportional to the norm of the eigenvalues matrix min U,V X UV 2 2 + α V 1 The l-1 norm regularization will encourage sparse solutions (zero eigenvalues) URL - Spring 2018 - MAI 35/78

Multidimensional Scaling A transformation matrix transforms a dataset from M dimensions to N dimensions preserving pairwise data distances [MxN] URL - Spring 2018 - MAI 36/78

Multidimensional Scaling Multidimensional Scaling: Projects the data to a space with less dimensions preserving the pair distances among the data A projection matrix is obtained by optimizing a function of the pairwise distances (stress function) The actual attributes are not used in the transformation Different objective functions that can be used (least squares, Sammong mapping, classical scaling,...). URL - Spring 2018 - MAI 37/78

Multidimensional Scaling Least Squares Multidimensional Scaling (MDS) The distorsion is defined as the square distance between the original distance matrix and the distance matrix of the new data S D (z 1, z 2,..., z n ) = The problem is defined as: [ i i (d ii z i z i 2 ) 2 arg mins D (z 1, z 2,..., z n ) z 1,z 2,...,z n ] URL - Spring 2018 - MAI 38/78

Multidimensional Scaling Several optimization strategies can be used If the distance matrix is euclidean it can be solved using eigen decomposition just like PCA In other cases gradient descent can be used using the derivative of S D (z 1, z 2,..., z n ) and a step α in the following fashion: 1. Begin with a guess for Z 2. Repeat until convergence: Z (k+1) = Z (k) α S D (Z) URL - Spring 2018 - MAI 39/78

Multidimensional Scaling - Other functions Sammong Mapping (emphasis on smaller distances) [ ] (d ii z i z i S D (z 1, z 2,..., z n ) = )2 d ii i i Classical Scaling (similarity instead of distance) [ ] S D (z 1, z 2,..., z n ) = (s ii z i z, z i z ) 2 i i Non metric MDS (assumes a ranking among the distances, non euclidean space) i,i [θ( z S D (z 1, z 2,..., z n ) = i z i ) d ii ] 2 i,i d 2 i,i URL - Spring 2018 - MAI 40/78

Random Projection A random transformation matrix is generated: Rectangular matrix N d Columns must have unit length Elements are generated from a gaussian distribution A matrix generated this way is almost orthogonal The projection will preserve the relative distances among examples The effectiveness usually depends on the number of dimensions URL - Spring 2018 - MAI 41/78

Nonnegative Matrix Factorization (NMF) This formulation assumes that the data is a sum of unknown positive latent variables NMF performs an approximation of a matrix as the product of two matrices V = W H The main difference with PCA is that the values of the matrices are constrained to be positive The positiveness assumption helps to interpret the result Eg.: In text mining, a document is an aggregation of topics URL - Spring 2018 - MAI 42/78

Nonlinear scaling The previous methods perform a linear transformation between the original space and the final space For some datasets this kind of transformation is not enough to maintain the information of the original data Nonlinear transformations methods: ISOMAP Local Linear Embedding Local MDS URL - Spring 2018 - MAI 43/78

ISOMAP Assumes a low dimensional dataset embedded in a larger number of dimensions The geodesic distance is used instead of the euclidean distance The relation of an instance with its immediate neighbors is more representative of the structure of the data The transformation generates a new space that preserves neighborhood relationships URL - Spring 2018 - MAI 44/78

ISOMAP URL - Spring 2018 - MAI 45/78

ISOMAP Euclidean b a Geodesic URL - Spring 2018 - MAI 46/78

ISOMAP - Algorithm 1. For each data point find its k closest neighbors (points at minimal euclidean distance) 2. Build a graph where each point has an edge to its closest neighbors 3. Approximate the geodesic distance for each pair of points by the shortest path in the graph 4. Apply a MDS algorithm to the distance matrix of the graph URL - Spring 2018 - MAI 47/78

a ISOMAP - Example Original Transformed b b a URL - Spring 2018 - MAI 48/78

Local Linear Embedding Performs a transformation that preserves local structure Assumes that each instance can be reconstructed by a linear combination of its neighbors (weights) From this weights a new set of data points that preserve the reconstruction is computed for a lower set of dimensions Different variants of the algorithm exist URL - Spring 2018 - MAI 49/78

Local Linear Embedding URL - Spring 2018 - MAI 50/78

Local Linear Embedding - Algorithm 1. For each data point find the K nearest neighbors in the original space of dimension p (N (i)) 2. Approximate each point by a mixture of the neighbors: min x i w ik x k 2 W ik k N (i) where w ik = 0 if k N (i) and N i=0 w ik = 1 and K < p 3. Find points y i in a space of dimension d < p that minimize: N y i i=0 N w ik y k 2 URL - Spring 2018 - MAI 51/78 k=0

Local MDS Performs a transformation that preserves locality of closer points and puts farther away non neighbor points Given a set of pairs of points N where a pair (i, i ) belong to the set if i is among the K neighbors of i or viceversa Minimize the function: S L (z 1, z 2,..., z N ) = (i,i ) N (d ii z i z i ) 2 τ (i,i ) N The parameters τ controls how much the non neighbors are scattered ( z i z i ) URL - Spring 2018 - MAI 52/78

t-sne t-stochastic Neighbor Embedding (t-sne) Used as visualization tool Assumes distances define a probability distribution Obtains a low dimensional space with the closest distribution (in terms of Kullback-Leibler distance) Tricky to use (see this link) URL - Spring 2018 - MAI 53/78

Wheel chair control characterization Wheelchair with shared control (patient/computer) Recorded trajectories of several patients in different situations Angle/distance to the goal, Angle/distance to the nearest obstacle from around the chair (210 degrees) Characterization about how the computer helps the patients with different handicaps Is there any structure in the trajectory data? URL - Spring 2018 - MAI 54/78

Wheel chair control characterization URL - Spring 2018 - MAI 55/78

Wheel chair control characterization URL - Spring 2018 - MAI 56/78

Wheel chair control (PCA) URL - Spring 2018 - MAI 57/78

Wheel chair control (SparsePCA) URL - Spring 2018 - MAI 58/78

Wheel chair control (MDS) URL - Spring 2018 - MAI 59/78

Wheel chair control (ISOMAP Kn=3) URL - Spring 2018 - MAI 60/78

Wheel chair control (ISOMAP Kn=10) URL - Spring 2018 - MAI 61/78

Unsupervised Attribute Selection To eliminate from the dataset all the redundant or irrelevant attributes The original attributes are preserved Less developed than in Supervised Attribute Selection Problem: An attribute can be relevant or not depending on the goal of the discovery process There are mainly two techniques for attribute selection: Wrapping and Filtering URL - Spring 2018 - MAI 62/78

Attribute selection - Wrappers A model evaluates the relevance of subsets of attributes In supervised learning this is easy, in unsupervised learning it is very difficult Results depend on the chosen model and on how well this model captures the actual structure of the data URL - Spring 2018 - MAI 63/78

Attribute selection - Wrapper Methods Clustering algorithms that compute weights for the attributes based on probability distributions Clustering algorithms with an objective function that penalizes the size of the model Consensus clustering URL - Spring 2018 - MAI 64/78

Attribute selection - Filters A measure evaluates the relevance of each attribute individually This kind of measures are difficult to obtain for unsupervised tasks The idea is to obtain a measure that evaluates the capacity of each attribute to reveal the structure of the data (class separability, similarity of instances in the same class) URL - Spring 2018 - MAI 65/78

Attribute selection - Filter Methods Measures of properties of the spatial structure of the data (Entropy, PCA, laplacian matrix) Measures of the relevance of the attributes respect the inherent structure of the data Measures of attribute correlation URL - Spring 2018 - MAI 66/78

Laplacian Score The Laplacian Score is a filter method that ranks the features respect to their ability of preserving the natural structure of the data. This method uses the spectral matrix of the graph computed from the near neighbors of the examples URL - Spring 2018 - MAI 67/78

Laplacian Score Similarity is usually computed using a gaussian kernel (edges not present have a value of 0) S ij = e x i x j 2 σ And all edges not present have a value of 0. The Laplacian matrix is computed from the similarity matrix S and the degree matrix D as L = S D URL - Spring 2018 - MAI 68/78

Laplacian Score The score first computes for each attribute r and their values f r the transformation f r as: f r = f r f r T D1 1 T D1 1 and then the score L r is computed as: L r = f r T L f r D f r This gives a ranking for the relevance of the attributes f T r URL - Spring 2018 - MAI 69/78

Python Notebooks These two Python Notebooks show some examples dimensionality reduction and feature selection Dimensionality reduction and feature selection Notebook (click here to go to the url) Linear and non linear dimensionality reduction Notebook (click here to go to the url) If you have downloaded the code from the repository you will able to play with the notebooks (run jupyter notebook to open the notebooks) URL - Spring 2018 - MAI 70/78

Similarity functions

Similarity/Distance functions Unsupervised algorithms use similarity/distance to compare data This comparison will be obtained using functions of the attributes of the data We assume that data are embedded in a N-dimensional space where it can be defined a similarity/distance There are domains where this assumption is not true so some other kind of functions will be needed to represent data relationships URL - Spring 2018 - MAI 71/78

Properties of Similarity/Distance functions The properties of a similarity function are: 1. s(p, q) = 1 p = q 2. p, q s(p, q) = s(q, p) The properties of a distance function are: 1. p, q d(p, q) 0 and p, q d(p, q) = 0 p = q 2. p, q d(p, q) = d(q, p) 3. p, q, r d(p, r) d(q, p) + d(p, r) URL - Spring 2018 - MAI 72/78

Distance functions Minkowski metrics (Manhattan, Euclidean) d(i, k) = ( d j=1 x ij x kj r ) 1 r Mahalanobis distance d(i, k) = (x i x k ) T ϕ 1 (x i x k ) where ϕ is the matrix covariance of the attributes URL - Spring 2018 - MAI 73/78

Distance functions Chebyshev Distance d(i, k) = max x ij x kj j Camberra distance d(i, k) = d j=1 x ij x kj x ij + x kj URL - Spring 2018 - MAI 74/78

Similarity functions Cosine similarity d(i, k) = xi T x j x i x j Pearson correlation measure d(i, k) = (x i x i ) T (x j x j ) x i x i x j x j URL - Spring 2018 - MAI 75/78

Similarity functions Coincidence coefficient s(i, k) = a 00 a 11 d Jaccard coefficient s(i, k) = a 11 a 00 + a 01 + a 10 URL - Spring 2018 - MAI 76/78

Python Notebooks This Python Notebook shows examples of using different distance functions Distance functions Notebook (click here to go to the url) If you have downloaded the code from the repository you will able to play with the notebooks (run jupyter notebook to open the notebooks) URL - Spring 2018 - MAI 77/78

Python Code In the code from the repository inside subdirectory DimReduction you have the Authors python script The code uses the datasets in the directory Data/authors Auth1 has fragments of books that are novels or philosophy works Auth2 has fragments of books written in English and books translated to English The code transforms the text to attribute vectors and applies different dimensionality reduction algorithms Modifying the code you can process one of the datasets and choose how the text is transformed into vectors URL - Spring 2018 - MAI 78/78