Data Preprocessing. Javier Béjar. URL - Spring 2018 CS - MAI 1/78 BY: $\

Size: px
Start display at page:

Download "Data Preprocessing. Javier Béjar. URL - Spring 2018 CS - MAI 1/78 BY: $\"

Transcription

1 Data Preprocessing Javier Béjar BY: $\ URL - Spring 2018 C CS - MAI 1/78

2 Introduction

3 Data representation Unstructured datasets: Examples described by a flat set of attributes: attribute-value matrix Structured datasets: Individual examples described by attributes but with relations among them: sequences (time, spatial,...), trees, graphs Sets of structured examples (sequences, graphs, trees) URL - Spring MAI 2/78

4 Unstructured data Only one table of observations Each example represents an instance of the problem Each instance is represented by a set of attributes (discrete, continuous) A B C a b b c d a URL - Spring MAI 3/78

5 Structured data One sequential relation among instances (Time, Strings) Several instances with internal structure Subsequences of unstructured instances One large instance Several relations among instances (graphs, trees) Several instances with internal structure One large instance URL - Spring MAI 4/78

6 Data Streams Endless sequence of data Several streams synchronized Unstructured instances Structured instances Static/Dynamic model URL - Spring MAI 5/78

7 Data representation Most unsupervised learning algorithms are specifically fitted for unstructured data The data representation is equivalent to a database table (attribute-value pairs) Specialized algorithms have been developed for structured data: Graph clustering, Sequence mining, Frequent substructures The representation of these types of data is sometimes algorithm dependent URL - Spring MAI 6/78

8 Data Preprocessing

9 Data preprocessing Usually raw data is not directly adequate for analysis The usual reasons: The quality of the data (noise/missing values/outliers) The dimensionality of the data (too many attributes/too many examples) The first step of any data task is to assess the quality of the data The techniques used for data preprocessing are usually oriented to unstructured data URL - Spring MAI 7/78

10 Outliers Outliers: Examples with extreme values compared to the rest of the data Can be considered as examples with erroneous values Have an important impact on some algorithms URL - Spring MAI 8/78

11 Outliers The exceptional values could appear in all or only a few attributes The usual way to correct this problem is to eliminate the examples If the exceptional values are only in a few attributes these could be treated as missing values URL - Spring MAI 9/78

12 Parametric Outliers Detection Assumes a probabilistic distribution for the attributes Univariate Perform Z-test or student s test Multivariate Deviation method: reduction in data variance when eliminated Angle based: variance of the angles to other examples Distance based: variation of the distance from the mean of the data in different dimensions URL - Spring MAI 10/78

13 Non parametric Outliers Detection Histogram based: Define a multidimensional grid and discard cells with low density Distance based: Distance of outliers to their k-nearest neighbors are larger Density based: Approximate data density using Kernel Density estimation or heuristic measures (Local Outlier Factor, LOF) URL - Spring MAI 11/78

14 Outliers: Local Outlier Factor LOF quantifies the outlierness of an example adjusting for variation in data density Uses the distance of the k-th neighbor D k (x) of an example and the set of examples that are inside this distance L k (x) The reachability distance between two data points R k (x, y) is defined as the maximum between the distance dist(x, y) and the y s k-th neighbor distance L k (y) URL - Spring MAI 12/78

15 Outliers: Local Outlier Factor The average reachability distance AR k (x) with respect of an example s neighborhood (L k (x)) is defined as the average of the reachabilities of the example to its neighbors The LOF of an example is computed as the mean ratio between AR k (x) and the average reachability of its k neighbors: LOF k (x) = 1 k y L k (x) AR k (x) AR k (y) This value ranks all the examples URL - Spring MAI 13/78

16 Outliers: Local Outlier Factor URL - Spring MAI 14/78

17 Missing values Missing values appear because of errors or omissions during the gathering of the data They can be substituted to increase the quality of the dataset (value imputation) Global constant for all the values Mean or mode of the attribute (global central tendency) Mean or mode of the attribute but only of the k nearest examples (local central tendency) Learn a model for the data (regression, bayesian) and use it to predict the values Problem: The statistical distribution of the data is changed URL - Spring MAI 15/78

18 Missing values Missing Values Mean substitution 1-neighbor substitution URL - Spring MAI 16/78

19 Normalization Normalizations are applied to quantitative attributes in order to eliminate the effect of having different scale measures Range normalization: Transform all the values of the attribute to a preestablished scale (e.g.: [0,1], [-1,1]) x x min x max x min Distribution normalization: Transform the data to a specific statistical distribution with preestablished parameters (e.g.: Gaussian N (0, 1)) x µ x σ x URL - Spring MAI 17/78

20 Discretization Discretization allows transforming quantitative attributes to qualitative attributes Equal size bins: Pick the number of values and divide the range of data in equal zised bins Equal frequency bins: Pick the number of values and divide the range of data so each bin has the same number of examples (the size of the intervals will be different) URL - Spring MAI 18/78

21 Discretization Discretization allows transforming quantitative attributes to qualitative attributes Distribution approximation: Calculate a histogram of the data and fit a kernel function (KDE), the intervals are where the function has its minima Other techniques: Apply entropy based measures, MDL or even clustering URL - Spring MAI 19/78

22 Discretization Same size Same Frequency Histogram URL - Spring MAI 20/78

23 Python Notebooks These two Python Notebooks show some examples of the effect of missing values imputation and data discretization and normalization Missing Values Notebook (click here to go to the url) Preprocessing Notebook (click here to go to the url) If you have downloaded the code from the repository you will be able to play with the notebooks (run jupyter notebook to open the notebooks) URL - Spring MAI 21/78

24 Dimensionality Reduction

25 The curse of dimensionality Problems due to the dimensionality of data The computational cost of processing the data The quality of the data Elements that define the dimensionality of data The number of examples The number of attributes Usually the problem of having too many examples can be solved using sampling. URL - Spring MAI 22/78

26 Reducing attributes The number of attributes has an impact on the performance: Poor scalability Inability to cope with irrelevant/noisy/redundant attributes Methodologies to reduce the number of attributes: Dimensionality reduction: Transforming to a space of less dimensions Feature subset selection: Eliminating not relevant attributes URL - Spring MAI 23/78

27 Dimensionality reduction New dataset that preserves the information of the original data but with less attributes Many techniques have been developed for this purpose Projection to a space that preserve the statistical distribution of the data (PCA, ICA) Projection to a space that preserves distances among the data (Multidimensional scaling, random projection, nonlinear scaling) URL - Spring MAI 24/78

28 Component analysis Principal Component Analysis: Data is projected onto a set of orthogonal dimensions (components) that are linear combination of the original attributes The components are uncorrelated and are ordered by the information they have We assume data follows gaussian distribution Global variance is preserved URL - Spring MAI 25/78

29 Principal Component Analysis Computes a projection matrix where the dimensions are orthogonal (linearly independent) and data variance is preserved Y w1*y+w2*x w3*y+w4*x X URL - Spring MAI 26/78

30 Principal Component Analysis Principal components: vectors that are the best linear approximation of the data f (λ) = µ + V q λ µ is a location vector in R p, V q is a p q matrix of q orthogonal unit vectors and λ is a q vector of parameters The reconstruction error for the data is minimized: min µ,{λ i },V q N x i µ V q λ i 2 i=1 URL - Spring MAI 27/78

31 Principal Component Analysis - Computation Optimizing partially for µ and λ i : µ = x λ i = V T iq (x i x) We can obtain the matrix V q by minimizing: min V q N (x i x) V q Vq T (x i x) 2 2 i=0 URL - Spring MAI 28/78

32 Principal Component Analysis - Computation Assuming x = 0 we can obtain the projection matrix H q = V q Vq T by Singular Value Decomposition of the data matrix X X = UDV T U is a N p orthogonal matrix, its columns are the left singular vectors V is a p p diagonal matrix with ordered diagonal values called the singular values URL - Spring MAI 29/78

33 Principal Component Analysis The columns of UD are the principal components The solution to the minimization problem are the first q principal components The singular values are proportional to the reconstruction error URL - Spring MAI 30/78

34 Kernel PCA PCA is a linear transformation, this means that if data is linearly separable, the reduced dataset will be linearly separable (given enough components) We can use the kernel trick to map the original attribute to a space where non linearly separable data is linearly separable Distances among examples are defined as a dot product that can be obtained using a kernel: d(x i, x j ) = Φ(x i ) T Φ(x j ) = K(x i, x j ) URL - Spring MAI 31/78

35 Kernel PCA Different kernels can be used to perform the transformation to the feature space (polynomial, gaussian,...) The computation of the components is equivalent to PCA but performing the eigen decomposition of the covariance matrix computed for the transformed examples C = 1 M M Φ(x j )Φ(x j ) T j=1 URL - Spring MAI 32/78

36 Kernel PCA Pro: Helps to discover patterns that are non linearly separable in the original space Con: Does not give a weight/importance for the new components URL - Spring MAI 33/78

37 Kernel PCA URL - Spring MAI 34/78

38 Sparse PCA PCA transforms data to a space of the same dimensionality (all eigenvalues are non zero) An alternative is to solve the minimization problem posed by the reconstruction error using regularization A penalization term is added to the objective function proportional to the norm of the eigenvalues matrix min U,V X UV α V 1 The l-1 norm regularization will encourage sparse solutions (zero eigenvalues) URL - Spring MAI 35/78

39 Multidimensional Scaling A transformation matrix transforms a dataset from M dimensions to N dimensions preserving pairwise data distances [MxN] URL - Spring MAI 36/78

40 Multidimensional Scaling Multidimensional Scaling: Projects the data to a space with less dimensions preserving the pair distances among the data A projection matrix is obtained by optimizing a function of the pairwise distances (stress function) The actual attributes are not used in the transformation Different objective functions that can be used (least squares, Sammong mapping, classical scaling,...). URL - Spring MAI 37/78

41 Multidimensional Scaling Least Squares Multidimensional Scaling (MDS) The distorsion is defined as the square distance between the original distance matrix and the distance matrix of the new data S D (z 1, z 2,..., z n ) = The problem is defined as: [ i i (d ii z i z i 2 ) 2 arg mins D (z 1, z 2,..., z n ) z 1,z 2,...,z n ] URL - Spring MAI 38/78

42 Multidimensional Scaling Several optimization strategies can be used If the distance matrix is euclidean it can be solved using eigen decomposition just like PCA In other cases gradient descent can be used using the derivative of S D (z 1, z 2,..., z n ) and a step α in the following fashion: 1. Begin with a guess for Z 2. Repeat until convergence: Z (k+1) = Z (k) α S D (Z) URL - Spring MAI 39/78

43 Multidimensional Scaling - Other functions Sammong Mapping (emphasis on smaller distances) [ ] (d ii z i z i S D (z 1, z 2,..., z n ) = )2 d ii i i Classical Scaling (similarity instead of distance) [ ] S D (z 1, z 2,..., z n ) = (s ii z i z, z i z ) 2 i i Non metric MDS (assumes a ranking among the distances, non euclidean space) i,i [θ( z S D (z 1, z 2,..., z n ) = i z i ) d ii ] 2 i,i d 2 i,i URL - Spring MAI 40/78

44 Random Projection A random transformation matrix is generated: Rectangular matrix N d Columns must have unit length Elements are generated from a gaussian distribution A matrix generated this way is almost orthogonal The projection will preserve the relative distances among examples The effectiveness usually depends on the number of dimensions URL - Spring MAI 41/78

45 Nonnegative Matrix Factorization (NMF) This formulation assumes that the data is a sum of unknown positive latent variables NMF performs an approximation of a matrix as the product of two matrices V = W H The main difference with PCA is that the values of the matrices are constrained to be positive The positiveness assumption helps to interpret the result Eg.: In text mining, a document is an aggregation of topics URL - Spring MAI 42/78

46 Nonlinear scaling The previous methods perform a linear transformation between the original space and the final space For some datasets this kind of transformation is not enough to maintain the information of the original data Nonlinear transformations methods: ISOMAP Local Linear Embedding Local MDS URL - Spring MAI 43/78

47 ISOMAP Assumes a low dimensional dataset embedded in a larger number of dimensions The geodesic distance is used instead of the euclidean distance The relation of an instance with its immediate neighbors is more representative of the structure of the data The transformation generates a new space that preserves neighborhood relationships URL - Spring MAI 44/78

48 ISOMAP URL - Spring MAI 45/78

49 ISOMAP Euclidean b a Geodesic URL - Spring MAI 46/78

50 ISOMAP - Algorithm 1. For each data point find its k closest neighbors (points at minimal euclidean distance) 2. Build a graph where each point has an edge to its closest neighbors 3. Approximate the geodesic distance for each pair of points by the shortest path in the graph 4. Apply a MDS algorithm to the distance matrix of the graph URL - Spring MAI 47/78

51 a ISOMAP - Example Original Transformed b b a URL - Spring MAI 48/78

52 Local Linear Embedding Performs a transformation that preserves local structure Assumes that each instance can be reconstructed by a linear combination of its neighbors (weights) From this weights a new set of data points that preserve the reconstruction is computed for a lower set of dimensions Different variants of the algorithm exist URL - Spring MAI 49/78

53 Local Linear Embedding URL - Spring MAI 50/78

54 Local Linear Embedding - Algorithm 1. For each data point find the K nearest neighbors in the original space of dimension p (N (i)) 2. Approximate each point by a mixture of the neighbors: min x i w ik x k 2 W ik k N (i) where w ik = 0 if k N (i) and N i=0 w ik = 1 and K < p 3. Find points y i in a space of dimension d < p that minimize: N y i i=0 N w ik y k 2 URL - Spring MAI 51/78 k=0

55 Local MDS Performs a transformation that preserves locality of closer points and puts farther away non neighbor points Given a set of pairs of points N where a pair (i, i ) belong to the set if i is among the K neighbors of i or viceversa Minimize the function: S L (z 1, z 2,..., z N ) = (i,i ) N (d ii z i z i ) 2 τ (i,i ) N The parameters τ controls how much the non neighbors are scattered ( z i z i ) URL - Spring MAI 52/78

56 t-sne t-stochastic Neighbor Embedding (t-sne) Used as visualization tool Assumes distances define a probability distribution Obtains a low dimensional space with the closest distribution (in terms of Kullback-Leibler distance) Tricky to use (see this link) URL - Spring MAI 53/78

57 Wheel chair control characterization Wheelchair with shared control (patient/computer) Recorded trajectories of several patients in different situations Angle/distance to the goal, Angle/distance to the nearest obstacle from around the chair (210 degrees) Characterization about how the computer helps the patients with different handicaps Is there any structure in the trajectory data? URL - Spring MAI 54/78

58 Wheel chair control characterization URL - Spring MAI 55/78

59 Wheel chair control characterization URL - Spring MAI 56/78

60 Wheel chair control (PCA) URL - Spring MAI 57/78

61 Wheel chair control (SparsePCA) URL - Spring MAI 58/78

62 Wheel chair control (MDS) URL - Spring MAI 59/78

63 Wheel chair control (ISOMAP Kn=3) URL - Spring MAI 60/78

64 Wheel chair control (ISOMAP Kn=10) URL - Spring MAI 61/78

65 Unsupervised Attribute Selection To eliminate from the dataset all the redundant or irrelevant attributes The original attributes are preserved Less developed than in Supervised Attribute Selection Problem: An attribute can be relevant or not depending on the goal of the discovery process There are mainly two techniques for attribute selection: Wrapping and Filtering URL - Spring MAI 62/78

66 Attribute selection - Wrappers A model evaluates the relevance of subsets of attributes In supervised learning this is easy, in unsupervised learning it is very difficult Results depend on the chosen model and on how well this model captures the actual structure of the data URL - Spring MAI 63/78

67 Attribute selection - Wrapper Methods Clustering algorithms that compute weights for the attributes based on probability distributions Clustering algorithms with an objective function that penalizes the size of the model Consensus clustering URL - Spring MAI 64/78

68 Attribute selection - Filters A measure evaluates the relevance of each attribute individually This kind of measures are difficult to obtain for unsupervised tasks The idea is to obtain a measure that evaluates the capacity of each attribute to reveal the structure of the data (class separability, similarity of instances in the same class) URL - Spring MAI 65/78

69 Attribute selection - Filter Methods Measures of properties of the spatial structure of the data (Entropy, PCA, laplacian matrix) Measures of the relevance of the attributes respect the inherent structure of the data Measures of attribute correlation URL - Spring MAI 66/78

70 Laplacian Score The Laplacian Score is a filter method that ranks the features respect to their ability of preserving the natural structure of the data. This method uses the spectral matrix of the graph computed from the near neighbors of the examples URL - Spring MAI 67/78

71 Laplacian Score Similarity is usually computed using a gaussian kernel (edges not present have a value of 0) S ij = e x i x j 2 σ And all edges not present have a value of 0. The Laplacian matrix is computed from the similarity matrix S and the degree matrix D as L = S D URL - Spring MAI 68/78

72 Laplacian Score The score first computes for each attribute r and their values f r the transformation f r as: f r = f r f r T D1 1 T D1 1 and then the score L r is computed as: L r = f r T L f r D f r This gives a ranking for the relevance of the attributes f T r URL - Spring MAI 69/78

73 Python Notebooks These two Python Notebooks show some examples dimensionality reduction and feature selection Dimensionality reduction and feature selection Notebook (click here to go to the url) Linear and non linear dimensionality reduction Notebook (click here to go to the url) If you have downloaded the code from the repository you will able to play with the notebooks (run jupyter notebook to open the notebooks) URL - Spring MAI 70/78

74 Similarity functions

75 Similarity/Distance functions Unsupervised algorithms use similarity/distance to compare data This comparison will be obtained using functions of the attributes of the data We assume that data are embedded in a N-dimensional space where it can be defined a similarity/distance There are domains where this assumption is not true so some other kind of functions will be needed to represent data relationships URL - Spring MAI 71/78

76 Properties of Similarity/Distance functions The properties of a similarity function are: 1. s(p, q) = 1 p = q 2. p, q s(p, q) = s(q, p) The properties of a distance function are: 1. p, q d(p, q) 0 and p, q d(p, q) = 0 p = q 2. p, q d(p, q) = d(q, p) 3. p, q, r d(p, r) d(q, p) + d(p, r) URL - Spring MAI 72/78

77 Distance functions Minkowski metrics (Manhattan, Euclidean) d(i, k) = ( d j=1 x ij x kj r ) 1 r Mahalanobis distance d(i, k) = (x i x k ) T ϕ 1 (x i x k ) where ϕ is the matrix covariance of the attributes URL - Spring MAI 73/78

78 Distance functions Chebyshev Distance d(i, k) = max x ij x kj j Camberra distance d(i, k) = d j=1 x ij x kj x ij + x kj URL - Spring MAI 74/78

79 Similarity functions Cosine similarity d(i, k) = xi T x j x i x j Pearson correlation measure d(i, k) = (x i x i ) T (x j x j ) x i x i x j x j URL - Spring MAI 75/78

80 Similarity functions Coincidence coefficient s(i, k) = a 00 a 11 d Jaccard coefficient s(i, k) = a 11 a 00 + a 01 + a 10 URL - Spring MAI 76/78

81 Python Notebooks This Python Notebook shows examples of using different distance functions Distance functions Notebook (click here to go to the url) If you have downloaded the code from the repository you will able to play with the notebooks (run jupyter notebook to open the notebooks) URL - Spring MAI 77/78

82 Python Code In the code from the repository inside subdirectory DimReduction you have the Authors python script The code uses the datasets in the directory Data/authors Auth1 has fragments of books that are novels or philosophy works Auth2 has fragments of books written in English and books translated to English The code transforms the text to attribute vectors and applies different dimensionality reduction algorithms Modifying the code you can process one of the datasets and choose how the text is transformed into vectors URL - Spring MAI 78/78

Data Preprocessing. Javier Béjar AMLT /2017 CS - MAI. (CS - MAI) Data Preprocessing AMLT / / 71 BY: $\

Data Preprocessing. Javier Béjar AMLT /2017 CS - MAI. (CS - MAI) Data Preprocessing AMLT / / 71 BY: $\ Data Preprocessing S - MAI AMLT - 2016/2017 (S - MAI) Data Preprocessing AMLT - 2016/2017 1 / 71 Outline 1 Introduction Data Representation 2 Data Preprocessing Outliers Missing Values Normalization Discretization

More information

Feature selection. Term 2011/2012 LSI - FIB. Javier Béjar cbea (LSI - FIB) Feature selection Term 2011/ / 22

Feature selection. Term 2011/2012 LSI - FIB. Javier Béjar cbea (LSI - FIB) Feature selection Term 2011/ / 22 Feature selection Javier Béjar cbea LSI - FIB Term 2011/2012 Javier Béjar cbea (LSI - FIB) Feature selection Term 2011/2012 1 / 22 Outline 1 Dimensionality reduction 2 Projections 3 Attribute selection

More information

Unsupervised Machine Learning and Data Mining

Unsupervised Machine Learning and Data Mining Unsupervised Machine Learning and Data Mining (ourse Slides) AMLT Master in Artificial Intelligence ourse 2016/2017 Fall Semester Esta obra está bajo una licencia Reconocimiento-Noomercial-ompartirIgual

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Learning without Class Labels (or correct outputs) Density Estimation Learn P(X) given training data for X Clustering Partition data into clusters Dimensionality Reduction Discover

More information

Dimension Reduction CS534

Dimension Reduction CS534 Dimension Reduction CS534 Why dimension reduction? High dimensionality large number of features E.g., documents represented by thousands of words, millions of bigrams Images represented by thousands of

More information

CSE 6242 A / CS 4803 DVA. Feb 12, Dimension Reduction. Guest Lecturer: Jaegul Choo

CSE 6242 A / CS 4803 DVA. Feb 12, Dimension Reduction. Guest Lecturer: Jaegul Choo CSE 6242 A / CS 4803 DVA Feb 12, 2013 Dimension Reduction Guest Lecturer: Jaegul Choo CSE 6242 A / CS 4803 DVA Feb 12, 2013 Dimension Reduction Guest Lecturer: Jaegul Choo Data is Too Big To Do Something..

More information

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification,

More information

Overview of Clustering

Overview of Clustering based on Loïc Cerfs slides (UFMG) April 2017 UCBL LIRIS DM2L Example of applicative problem Student profiles Given the marks received by students for different courses, how to group the students so that

More information

CSE 6242 A / CX 4242 DVA. March 6, Dimension Reduction. Guest Lecturer: Jaegul Choo

CSE 6242 A / CX 4242 DVA. March 6, Dimension Reduction. Guest Lecturer: Jaegul Choo CSE 6242 A / CX 4242 DVA March 6, 2014 Dimension Reduction Guest Lecturer: Jaegul Choo Data is Too Big To Analyze! Limited memory size! Data may not be fitted to the memory of your machine! Slow computation!

More information

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Exploratory data analysis tasks Examine the data, in search of structures

More information

CSE 6242 / CX October 9, Dimension Reduction. Guest Lecturer: Jaegul Choo

CSE 6242 / CX October 9, Dimension Reduction. Guest Lecturer: Jaegul Choo CSE 6242 / CX 4242 October 9, 2014 Dimension Reduction Guest Lecturer: Jaegul Choo Volume Variety Big Data Era 2 Velocity Veracity 3 Big Data are High-Dimensional Examples of High-Dimensional Data Image

More information

Lecture Topic Projects

Lecture Topic Projects Lecture Topic Projects 1 Intro, schedule, and logistics 2 Applications of visual analytics, basic tasks, data types 3 Introduction to D3, basic vis techniques for non-spatial data Project #1 out 4 Data

More information

Machine Learning: Think Big and Parallel

Machine Learning: Think Big and Parallel Day 1 Inderjit S. Dhillon Dept of Computer Science UT Austin CS395T: Topics in Multicore Programming Oct 1, 2013 Outline Scikit-learn: Machine Learning in Python Supervised Learning day1 Regression: Least

More information

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar Data Preprocessing Aggregation Sampling Dimensionality Reduction Feature subset selection Feature creation

More information

CPSC 340: Machine Learning and Data Mining. Multi-Dimensional Scaling Fall 2017

CPSC 340: Machine Learning and Data Mining. Multi-Dimensional Scaling Fall 2017 CPSC 340: Machine Learning and Data Mining Multi-Dimensional Scaling Fall 2017 Assignment 4: Admin 1 late day for tonight, 2 late days for Wednesday. Assignment 5: Due Monday of next week. Final: Details

More information

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016 CPSC 340: Machine Learning and Data Mining Principal Component Analysis Fall 2016 A2/Midterm: Admin Grades/solutions will be posted after class. Assignment 4: Posted, due November 14. Extra office hours:

More information

Unsupervised Learning

Unsupervised Learning Networks for Pattern Recognition, 2014 Networks for Single Linkage K-Means Soft DBSCAN PCA Networks for Kohonen Maps Linear Vector Quantization Networks for Problems/Approaches in Machine Learning Supervised

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

( ) =cov X Y = W PRINCIPAL COMPONENT ANALYSIS. Eigenvectors of the covariance matrix are the principal components

( ) =cov X Y = W PRINCIPAL COMPONENT ANALYSIS. Eigenvectors of the covariance matrix are the principal components Review Lecture 14 ! PRINCIPAL COMPONENT ANALYSIS Eigenvectors of the covariance matrix are the principal components 1. =cov X Top K principal components are the eigenvectors with K largest eigenvalues

More information

Nonlinear projections. Motivation. High-dimensional. data are. Perceptron) ) or RBFN. Multi-Layer. Example: : MLP (Multi(

Nonlinear projections. Motivation. High-dimensional. data are. Perceptron) ) or RBFN. Multi-Layer. Example: : MLP (Multi( Nonlinear projections Université catholique de Louvain (Belgium) Machine Learning Group http://www.dice.ucl ucl.ac.be/.ac.be/mlg/ 1 Motivation High-dimensional data are difficult to represent difficult

More information

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin Clustering K-means Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, 2014 Carlos Guestrin 2005-2014 1 Clustering images Set of Images [Goldberger et al.] Carlos Guestrin 2005-2014

More information

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Outline Introduction Data pre-treatment 1. Normalization 2. Centering,

More information

3 Feature Selection & Feature Extraction

3 Feature Selection & Feature Extraction 3 Feature Selection & Feature Extraction Overview: 3.1 Introduction 3.2 Feature Extraction 3.3 Feature Selection 3.3.1 Max-Dependency, Max-Relevance, Min-Redundancy 3.3.2 Relevance Filter 3.3.3 Redundancy

More information

MSA220 - Statistical Learning for Big Data

MSA220 - Statistical Learning for Big Data MSA220 - Statistical Learning for Big Data Lecture 13 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Clustering Explorative analysis - finding groups

More information

Non-linear dimension reduction

Non-linear dimension reduction Sta306b May 23, 2011 Dimension Reduction: 1 Non-linear dimension reduction ISOMAP: Tenenbaum, de Silva & Langford (2000) Local linear embedding: Roweis & Saul (2000) Local MDS: Chen (2006) all three methods

More information

Proximity and Data Pre-processing

Proximity and Data Pre-processing Proximity and Data Pre-processing Slide 1/47 Proximity and Data Pre-processing Huiping Cao Proximity and Data Pre-processing Slide 2/47 Outline Types of data Data quality Measurement of proximity Data

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Unsupervised Learning: Clustering Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com (Some material

More information

Data mining. Classification k-nn Classifier. Piotr Paszek. (Piotr Paszek) Data mining k-nn 1 / 20

Data mining. Classification k-nn Classifier. Piotr Paszek. (Piotr Paszek) Data mining k-nn 1 / 20 Data mining Piotr Paszek Classification k-nn Classifier (Piotr Paszek) Data mining k-nn 1 / 20 Plan of the lecture 1 Lazy Learner 2 k-nearest Neighbor Classifier 1 Distance (metric) 2 How to Determine

More information

Content-based image and video analysis. Machine learning

Content-based image and video analysis. Machine learning Content-based image and video analysis Machine learning for multimedia retrieval 04.05.2009 What is machine learning? Some problems are very hard to solve by writing a computer program by hand Almost all

More information

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample CS 1675 Introduction to Machine Learning Lecture 18 Clustering Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem:

More information

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani Feature Selection CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Dimensionality reduction Feature selection vs. feature extraction Filter univariate

More information

Exploratory Data Analysis using Self-Organizing Maps. Madhumanti Ray

Exploratory Data Analysis using Self-Organizing Maps. Madhumanti Ray Exploratory Data Analysis using Self-Organizing Maps Madhumanti Ray Content Introduction Data Analysis methods Self-Organizing Maps Conclusion Visualization of high-dimensional data items Exploratory data

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2012 http://ce.sharif.edu/courses/90-91/2/ce725-1/ Agenda Features and Patterns The Curse of Size and

More information

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1 Preface to the Second Edition Preface to the First Edition vii xi 1 Introduction 1 2 Overview of Supervised Learning 9 2.1 Introduction... 9 2.2 Variable Types and Terminology... 9 2.3 Two Simple Approaches

More information

ADVANCED MACHINE LEARNING. Mini-Project Overview

ADVANCED MACHINE LEARNING. Mini-Project Overview 11 Mini-Project Overview Lecture : Prof. Aude Billard (aude.billard@epfl.ch) Teaching Assistants : Nadia Figueroa, Ilaria Lauzana, Brice Platerrier 22 Deadlines for projects / surveys Sign up for lit.

More information

CS 521 Data Mining Techniques Instructor: Abdullah Mueen

CS 521 Data Mining Techniques Instructor: Abdullah Mueen CS 521 Data Mining Techniques Instructor: Abdullah Mueen LECTURE 2: DATA TRANSFORMATION AND DIMENSIONALITY REDUCTION Chapter 3: Data Preprocessing Data Preprocessing: An Overview Data Quality Major Tasks

More information

Unsupervised learning in Vision

Unsupervised learning in Vision Chapter 7 Unsupervised learning in Vision The fields of Computer Vision and Machine Learning complement each other in a very natural way: the aim of the former is to extract useful information from visual

More information

Mixture Models and the EM Algorithm

Mixture Models and the EM Algorithm Mixture Models and the EM Algorithm Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Finite Mixture Models Say we have a data set D = {x 1,..., x N } where x i is

More information

10-701/15-781, Fall 2006, Final

10-701/15-781, Fall 2006, Final -7/-78, Fall 6, Final Dec, :pm-8:pm There are 9 questions in this exam ( pages including this cover sheet). If you need more room to work out your answer to a question, use the back of the page and clearly

More information

Robust Kernel Methods in Clustering and Dimensionality Reduction Problems

Robust Kernel Methods in Clustering and Dimensionality Reduction Problems Robust Kernel Methods in Clustering and Dimensionality Reduction Problems Jian Guo, Debadyuti Roy, Jing Wang University of Michigan, Department of Statistics Introduction In this report we propose robust

More information

Clustering and Dimensionality Reduction. Stony Brook University CSE545, Fall 2017

Clustering and Dimensionality Reduction. Stony Brook University CSE545, Fall 2017 Clustering and Dimensionality Reduction Stony Brook University CSE545, Fall 2017 Goal: Generalize to new data Model New Data? Original Data Does the model accurately reflect new data? Supervised vs. Unsupervised

More information

Spatial Outlier Detection

Spatial Outlier Detection Spatial Outlier Detection Chang-Tien Lu Department of Computer Science Northern Virginia Center Virginia Tech Joint work with Dechang Chen, Yufeng Kou, Jiang Zhao 1 Spatial Outlier A spatial data point

More information

A Survey on Pre-processing and Post-processing Techniques in Data Mining

A Survey on Pre-processing and Post-processing Techniques in Data Mining , pp. 99-128 http://dx.doi.org/10.14257/ijdta.2014.7.4.09 A Survey on Pre-processing and Post-processing Techniques in Data Mining Divya Tomar and Sonali Agarwal Indian Institute of Information Technology,

More information

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011 Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011 1. Introduction Reddit is one of the most popular online social news websites with millions

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Features and Patterns The Curse of Size and

More information

Using Machine Learning to Optimize Storage Systems

Using Machine Learning to Optimize Storage Systems Using Machine Learning to Optimize Storage Systems Dr. Kiran Gunnam 1 Outline 1. Overview 2. Building Flash Models using Logistic Regression. 3. Storage Object classification 4. Storage Allocation recommendation

More information

A Course in Machine Learning

A Course in Machine Learning A Course in Machine Learning Hal Daumé III 13 UNSUPERVISED LEARNING If you have access to labeled training data, you know what to do. This is the supervised setting, in which you have a teacher telling

More information

Courtesy of Prof. Shixia University

Courtesy of Prof. Shixia University Courtesy of Prof. Shixia Liu @Tsinghua University Outline Introduction Classification of Techniques Table Scatter Plot Matrices Projections Parallel Coordinates Summary Motivation Real world data contain

More information

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering Introduction to Pattern Recognition Part II Selim Aksoy Bilkent University Department of Computer Engineering saksoy@cs.bilkent.edu.tr RETINA Pattern Recognition Tutorial, Summer 2005 Overview Statistical

More information

Data preprocessing Functional Programming and Intelligent Algorithms

Data preprocessing Functional Programming and Intelligent Algorithms Data preprocessing Functional Programming and Intelligent Algorithms Que Tran Høgskolen i Ålesund 20th March 2017 1 Why data preprocessing? Real-world data tend to be dirty incomplete: lacking attribute

More information

Understanding Clustering Supervising the unsupervised

Understanding Clustering Supervising the unsupervised Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data

More information

Machine Learning for Signal Processing Clustering. Bhiksha Raj Class Oct 2016

Machine Learning for Signal Processing Clustering. Bhiksha Raj Class Oct 2016 Machine Learning for Signal Processing Clustering Bhiksha Raj Class 11. 13 Oct 2016 1 Statistical Modelling and Latent Structure Much of statistical modelling attempts to identify latent structure in the

More information

Geometric Registration for Deformable Shapes 3.3 Advanced Global Matching

Geometric Registration for Deformable Shapes 3.3 Advanced Global Matching Geometric Registration for Deformable Shapes 3.3 Advanced Global Matching Correlated Correspondences [ASP*04] A Complete Registration System [HAW*08] In this session Advanced Global Matching Some practical

More information

Missing Data Estimation in Microarrays Using Multi-Organism Approach

Missing Data Estimation in Microarrays Using Multi-Organism Approach Missing Data Estimation in Microarrays Using Multi-Organism Approach Marcel Nassar and Hady Zeineddine Progress Report: Data Mining Course Project, Spring 2008 Prof. Inderjit S. Dhillon April 02, 2008

More information

Non-Bayesian Classifiers Part I: k-nearest Neighbor Classifier and Distance Functions

Non-Bayesian Classifiers Part I: k-nearest Neighbor Classifier and Distance Functions Non-Bayesian Classifiers Part I: k-nearest Neighbor Classifier and Distance Functions Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2017 CS 551,

More information

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017 CPSC 340: Machine Learning and Data Mining Principal Component Analysis Fall 2017 Assignment 3: 2 late days to hand in tonight. Admin Assignment 4: Due Friday of next week. Last Time: MAP Estimation MAP

More information

Today. Gradient descent for minimization of functions of real variables. Multi-dimensional scaling. Self-organizing maps

Today. Gradient descent for minimization of functions of real variables. Multi-dimensional scaling. Self-organizing maps Today Gradient descent for minimization of functions of real variables. Multi-dimensional scaling Self-organizing maps Gradient Descent Derivatives Consider function f(x) : R R. The derivative w.r.t. x

More information

Data Mining: Data. What is Data? Lecture Notes for Chapter 2. Introduction to Data Mining. Properties of Attribute Values. Types of Attributes

Data Mining: Data. What is Data? Lecture Notes for Chapter 2. Introduction to Data Mining. Properties of Attribute Values. Types of Attributes 0 Data Mining: Data What is Data? Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar Collection of data objects and their attributes An attribute is a property or characteristic

More information

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining 10 Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 What is Data? Collection of data objects

More information

Clustering. So far in the course. Clustering. Clustering. Subhransu Maji. CMPSCI 689: Machine Learning. dist(x, y) = x y 2 2

Clustering. So far in the course. Clustering. Clustering. Subhransu Maji. CMPSCI 689: Machine Learning. dist(x, y) = x y 2 2 So far in the course Clustering Subhransu Maji : Machine Learning 2 April 2015 7 April 2015 Supervised learning: learning with a teacher You had training data which was (feature, label) pairs and the goal

More information

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015 Instance-based Learning CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2015 Outline Non-parametric approach Unsupervised: Non-parametric density estimation Parzen Windows K-Nearest

More information

Week 7 Picturing Network. Vahe and Bethany

Week 7 Picturing Network. Vahe and Bethany Week 7 Picturing Network Vahe and Bethany Freeman (2005) - Graphic Techniques for Exploring Social Network Data The two main goals of analyzing social network data are identification of cohesive groups

More information

Multiresponse Sparse Regression with Application to Multidimensional Scaling

Multiresponse Sparse Regression with Application to Multidimensional Scaling Multiresponse Sparse Regression with Application to Multidimensional Scaling Timo Similä and Jarkko Tikka Helsinki University of Technology, Laboratory of Computer and Information Science P.O. Box 54,

More information

CS325 Artificial Intelligence Ch. 20 Unsupervised Machine Learning

CS325 Artificial Intelligence Ch. 20 Unsupervised Machine Learning CS325 Artificial Intelligence Cengiz Spring 2013 Unsupervised Learning Missing teacher No labels, y Just input data, x What can you learn with it? Unsupervised Learning Missing teacher No labels, y Just

More information

ECG782: Multidimensional Digital Signal Processing

ECG782: Multidimensional Digital Signal Processing Professor Brendan Morris, SEB 3216, brendan.morris@unlv.edu ECG782: Multidimensional Digital Signal Processing Spring 2014 TTh 14:30-15:45 CBC C313 Lecture 06 Image Structures 13/02/06 http://www.ee.unlv.edu/~b1morris/ecg782/

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2013 http://ce.sharif.edu/courses/91-92/2/ce725-1/ Agenda Features and Patterns The Curse of Size and

More information

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Descriptive model A descriptive model presents the main features of the data

More information

ECS 234: Data Analysis: Clustering ECS 234

ECS 234: Data Analysis: Clustering ECS 234 : Data Analysis: Clustering What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed

More information

CS 195-5: Machine Learning Problem Set 5

CS 195-5: Machine Learning Problem Set 5 CS 195-5: Machine Learning Problem Set 5 Douglas Lanman dlanman@brown.edu 26 November 26 1 Clustering and Vector Quantization Problem 1 Part 1: In this problem we will apply Vector Quantization (VQ) to

More information

Vector Space Models: Theory and Applications

Vector Space Models: Theory and Applications Vector Space Models: Theory and Applications Alexander Panchenko Centre de traitement automatique du langage (CENTAL) Université catholique de Louvain FLTR 2620 Introduction au traitement automatique du

More information

Data Preprocessing. Data Mining 1

Data Preprocessing. Data Mining 1 Data Preprocessing Today s real-world databases are highly susceptible to noisy, missing, and inconsistent data due to their typically huge size and their likely origin from multiple, heterogenous sources.

More information

Alternative Statistical Methods for Bone Atlas Modelling

Alternative Statistical Methods for Bone Atlas Modelling Alternative Statistical Methods for Bone Atlas Modelling Sharmishtaa Seshamani, Gouthami Chintalapani, Russell Taylor Department of Computer Science, Johns Hopkins University, Baltimore, MD Traditional

More information

Voronoi Region. K-means method for Signal Compression: Vector Quantization. Compression Formula 11/20/2013

Voronoi Region. K-means method for Signal Compression: Vector Quantization. Compression Formula 11/20/2013 Voronoi Region K-means method for Signal Compression: Vector Quantization Blocks of signals: A sequence of audio. A block of image pixels. Formally: vector example: (0.2, 0.3, 0.5, 0.1) A vector quantizer

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Pierre Gaillard ENS Paris September 28, 2018 1 Supervised vs unsupervised learning Two main categories of machine learning algorithms: - Supervised learning: predict output Y from

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Unsupervised learning Daniel Hennes 29.01.2018 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Supervised learning Regression (linear

More information

Clustering. Subhransu Maji. CMPSCI 689: Machine Learning. 2 April April 2015

Clustering. Subhransu Maji. CMPSCI 689: Machine Learning. 2 April April 2015 Clustering Subhransu Maji CMPSCI 689: Machine Learning 2 April 2015 7 April 2015 So far in the course Supervised learning: learning with a teacher You had training data which was (feature, label) pairs

More information

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling Machine Learning B. Unsupervised Learning B.1 Cluster Analysis Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University of Hildesheim,

More information

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme Machine Learning B. Unsupervised Learning B.1 Cluster Analysis Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University of Hildesheim, Germany

More information

A Dendrogram. Bioinformatics (Lec 17)

A Dendrogram. Bioinformatics (Lec 17) A Dendrogram 3/15/05 1 Hierarchical Clustering [Johnson, SC, 1967] Given n points in R d, compute the distance between every pair of points While (not done) Pick closest pair of points s i and s j and

More information

Generative and discriminative classification techniques

Generative and discriminative classification techniques Generative and discriminative classification techniques Machine Learning and Category Representation 013-014 Jakob Verbeek, December 13+0, 013 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.13.14

More information

Clustering Lecture 5: Mixture Model

Clustering Lecture 5: Mixture Model Clustering Lecture 5: Mixture Model Jing Gao SUNY Buffalo 1 Outline Basics Motivation, definition, evaluation Methods Partitional Hierarchical Density-based Mixture model Spectral methods Advanced topics

More information

The Pre-Image Problem in Kernel Methods

The Pre-Image Problem in Kernel Methods The Pre-Image Problem in Kernel Methods James Kwok Ivor Tsang Department of Computer Science Hong Kong University of Science and Technology Hong Kong The Pre-Image Problem in Kernel Methods ICML-2003 1

More information

Machine Learning Feature Creation and Selection

Machine Learning Feature Creation and Selection Machine Learning Feature Creation and Selection Jeff Howbert Introduction to Machine Learning Winter 2012 1 Feature creation Well-conceived new features can sometimes capture the important information

More information

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 1st, 2018

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 1st, 2018 Data Mining CS57300 Purdue University Bruno Ribeiro February 1st, 2018 1 Exploratory Data Analysis & Feature Construction How to explore a dataset Understanding the variables (values, ranges, and empirical

More information

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler, Sanjay Ranka

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler, Sanjay Ranka BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler, Sanjay Ranka Topics What is data? Definitions, terminology Types of data and datasets Data preprocessing Data Cleaning Data integration

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Ms. Gayatri Attarde 1, Prof. Aarti Deshpande 2 M. E Student, Department of Computer Engineering, GHRCCEM, University

More information

INF 4300 Classification III Anne Solberg The agenda today:

INF 4300 Classification III Anne Solberg The agenda today: INF 4300 Classification III Anne Solberg 28.10.15 The agenda today: More on estimating classifier accuracy Curse of dimensionality and simple feature selection knn-classification K-means clustering 28.10.15

More information

Table Of Contents: xix Foreword to Second Edition

Table Of Contents: xix Foreword to Second Edition Data Mining : Concepts and Techniques Table Of Contents: Foreword xix Foreword to Second Edition xxi Preface xxiii Acknowledgments xxxi About the Authors xxxv Chapter 1 Introduction 1 (38) 1.1 Why Data

More information

Contents. Preface to the Second Edition

Contents. Preface to the Second Edition Preface to the Second Edition v 1 Introduction 1 1.1 What Is Data Mining?....................... 4 1.2 Motivating Challenges....................... 5 1.3 The Origins of Data Mining....................

More information

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394 Data Mining Clustering Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 31 Table of contents 1 Introduction 2 Data matrix and

More information

CLASSIFICATION AND CHANGE DETECTION

CLASSIFICATION AND CHANGE DETECTION IMAGE ANALYSIS, CLASSIFICATION AND CHANGE DETECTION IN REMOTE SENSING With Algorithms for ENVI/IDL and Python THIRD EDITION Morton J. Canty CRC Press Taylor & Francis Group Boca Raton London NewYork CRC

More information

University of Florida CISE department Gator Engineering. Data Preprocessing. Dr. Sanjay Ranka

University of Florida CISE department Gator Engineering. Data Preprocessing. Dr. Sanjay Ranka Data Preprocessing Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville ranka@cise.ufl.edu Data Preprocessing What preprocessing step can or should

More information

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data.

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data. Summary Statistics Acquisition Description Exploration Examination what data is collected Characterizing properties of data. Exploring the data distribution(s). Identifying data quality problems. Selecting

More information

Chapter 9: Outlier Analysis

Chapter 9: Outlier Analysis Chapter 9: Outlier Analysis Jilles Vreeken 8 Dec 2015 IRDM Chapter 9, overview 1. Basics & Motivation 2. Extreme Value Analysis 3. Probabilistic Methods 4. Cluster-based Methods 5. Distance-based Methods

More information

General Instructions. Questions

General Instructions. Questions CS246: Mining Massive Data Sets Winter 2018 Problem Set 2 Due 11:59pm February 8, 2018 Only one late period is allowed for this homework (11:59pm 2/13). General Instructions Submission instructions: These

More information

CSE 547: Machine Learning for Big Data Spring Problem Set 2. Please read the homework submission policies.

CSE 547: Machine Learning for Big Data Spring Problem Set 2. Please read the homework submission policies. CSE 547: Machine Learning for Big Data Spring 2019 Problem Set 2 Please read the homework submission policies. 1 Principal Component Analysis and Reconstruction (25 points) Let s do PCA and reconstruct

More information

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06 Clustering CS294 Practical Machine Learning Junming Yin 10/09/06 Outline Introduction Unsupervised learning What is clustering? Application Dissimilarity (similarity) of objects Clustering algorithm K-means,

More information