Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Exploratory data analysis tasks Examine the data, in search of structures that may indicate deeper relationships between instances or attributes A data-driven hypothesis generation process Key point: how to properly describe the data to reveal the characteristics? Data description method: Summarizing data with some statistics Portraying the distribution information of the data set

Summarizing data: Location Mean: The average value of a collection of values Median: The value that produces equal number of data points above it and below it Quartile The n-th quatile is the value that is greater than n quarter(s ) of the data points. (n = 1, 2, 3) Percentile The n-th percentile is the value that is greater than n% of the data points (n = 1,, 99)

Summarizing data: other properties Mode (quantity) The most common values Variance (dispersion) The average of the squared differences between the mean and the individual values Interquartile range (dispersion) The difference between the 1st quartile and the 3rd quartile Skewness (shape) Does a distribution have a single long tail

Data visualization: a first step toward data mining Data visualization is an important way for identifying deep relationship Pros straight-forward usually interactive ideal for sifting through data to find unexpected relation Cons requires special people to read the results to find unexpected relation might not be good for large data sets, too many details may shade the interesting patterns

Widely used methods for visualization Displaying single attribute Displaying the relationships between two attributes Displaying the relationships between multiple attributes Displaying important structure of data in a reduced number of dimensions

Tools for displaying single attribute (I) Histogram The number of different values of a nominal attribute The number of values of an numerical attribute lie in consecutive intervals Random fluctuations and alternative choices for ends may affect the diagram if the data set is small

Tools for displaying single attribute (II) Kernel estimates (Density estimation) Spread the contribution of each observed data point into the whole range, K(.) is called kernel function. K(.) is usually a smooth unimodal function with a peak at 0 e.g., The most commonly used form is the Gaussian kernel: The quality depends more on h than the shape of K

Tools for displaying single attribute (III) Box plots graphically depicting groups of numerical data through their five-number summaries Five number summary: (Min, Q1, Median, Q3, Max) Max Q3 median Q1 1.5*(Q3-Q1) Functionality : identifying possible outliers Min

Tools for displaying pair of attributes (I) Scatterplot It is a bimodality plot, where each pair of values belonging to the same instance is treated as a 2-d coordinate Functionality: revealing certain the correlation of the two variables Might not be useful especially for large data set (with long-tailed distribution) The bimodality (0-1) representation ignores the frequency of certain coordinate appears.

Tools for displaying pair of attributes (II) Contour plot It plots a 2-d density contour with respect to the concerned two variables, where the density is estimated from the observed data points. Functionality: revealing the correlation of two variables in terms of the joint distribution

Tools for displaying pair of attributes (III) Loess curve Loess curve is a local regression of the points on scatter plot Functionality: Providing better perception of the pattern of dependence.

Tools for displaying multiple attributes (I) Scatterplot matrix (pseudo-multivariate) Aligning scatterplots for every pair of attributes Functionality: Revealing certain correlation of any two attributes A pseudo-multivariate tool: since it is multiple bivariate solutions

Tools for displaying multiple attributes (II) Trellis plot Fixing a particular pair of attributes that is to be displayed produces a series of scatterplots conditioned on levels of one or more other attributes Functionality: Revealing certain correlation of any two attributes with consideration of other attributes values Considers multivariate to some extent Any type of graph can be used besides scatter plots

Tools for displaying multiple attributes (III) Icons plots It represents each instance as a multidimensional symbol Functionality: Providing a selection of instances by revealing its multivariate correlation More difficult to read. Failing to apply to large data set Revealing individual characteristics instead of global distributional information

Tools for displaying multiple attributes (III) Icons plots

Tools for displaying multiple attributes (VI) Parallel coordinates plot It represents each instance as a piecewise linear plot connecting the measured values for that instance Functionality: Providing a selection of instances by revealing its multivariate correlation More difficult to read. Failing to apply to large data set Revealing individual characteristics instead of global distributional information

Discovering important structures from data Problems with multivariate plot Difficult to read Failing to capture the global distributional information Failing to scale to large data set Too many details are represented, which shade out the interesting patterns in the data That s it! Find the interesting structures from the data while eliminating unnecessary details

Dimensionality reduction Reduce the number of dimensions (attributes) of the data while preserving the intrinsic characteristics of the data Try to identify some hidden aspects that determines the characteristics of the data Linear Method Principal component analysis (PCA) Linear discriminate analysis (LDA) Factor analysis Non-linear method KPCA, KLDA Multidimensional scaling Manifold learning methods

Principal component analysis (PCA) What hidden is the interesting aspect that structure control the of data? the data Principal component The spreading tendency

Principal component analysis (PCA) Principal component analysis seeks a space of lower dimensionality, in which the variance of the projected data are maximized. try to model how data are spread by maximizing the variance of the projected data find d orthogonal vectors that represent the spread tendency of the data (d << D) the d vectors will be used as the bases of the new space

Principal component analysis (PCA) Data representations where, x D n, Projection of a data point Projection of the data set

Principal component analysis (PCA) Variance in projected space (The score function) Maximization max w s.t. Solution: (1) by introducing Lagrange multiplier (2) Solve the equation by Eigen decomposition

Principal component analysis (PCA) Find d orthogonal principal components It can be shown that the d orthogonal principle components correspond to the d eigen vectors with d largest eigen values Intuitions: Eigen value is the variance of the data projected to the corresponding eigen vector Eigen vectors are essentially orthogonal Determine d What to discard: the eigen vectors with small eigen values Limit the loss of information within an acceptable interval (usually 5%)

Principal component analysis (PCA) Conducting PCA 1. Shifting data set to have zeros mean 2. Compute the covariance matrix S 3. Conduct eigen decomposition on S, and rank the eigen vectors according to their eigen values 4. Determine the number of dimensions d of the new space 5. Select the first d eigen vectors with d largest eigen values as the basis of the new space

Applications of PCA Visualizing the projected data Data are projected in to 2D or 3D subspace using PCA Visualize the projected data for domain experts For subsequent data mining algorithms E.g. face recognition Eigen face

Factor analysis It seeks some latent (unobserved) factors that can recover the observed variables with their linear combination. It assumes that the data are determined by these factors (through linear combination) plus some random noise The solution is not unique. Additional constraints should be imposed to obtain unique solution e.g., a widely used constraints forces the elements of Λ to be close to 1or 0, which enables to group attributes

Multidimensional scaling It seeks the coordinates in low dimensional space (usually 2D or 3D) by preserving the similarity relationships between data points. Start with a similarity (distance) matrix instead of data points Minimize some pre-defined difference between the similarity in original space and that in the new space The new space can be solved by mathematical projection or direct optimization over the coordinates Remarks Pros: More freedom to make non-linear mapping Cons: limited non-linear since it tries to preserve the similarity relationships between an instance to all other instances

Isometric feature mapping (ISOMAP) It seeks a low dimensional embedding of data such that the distance in the new space roughly equals the geodesic distances constructed through neighborhoods of instances Key step: Find the shortest path between any two points on a neighborhood graph

Let s move to Chapter 4