Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Similar documents
Dimension Reduction CS534

Visualizing and Exploring Data

Lecture 3 Questions that we should be able to answer by the end of this lecture:

Lecture 3 Questions that we should be able to answer by the end of this lecture:

CSE 6242 A / CX 4242 DVA. March 6, Dimension Reduction. Guest Lecturer: Jaegul Choo

Chapter 1. Looking at Data-Distribution

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 1st, 2018

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data.

Measures of Central Tendency. A measure of central tendency is a value used to represent the typical or average value in a data set.

CSE 6242 A / CS 4803 DVA. Feb 12, Dimension Reduction. Guest Lecturer: Jaegul Choo

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

Part I, Chapters 4 & 5. Data Tables and Data Analysis Statistics and Figures

Vocabulary. 5-number summary Rule. Area principle. Bar chart. Boxplot. Categorical data condition. Categorical variable.

Measures of Central Tendency

Table of Contents (As covered from textbook)

Density Curve (p52) Density curve is a curve that - is always on or above the horizontal axis.

Frequency Distributions

CSE 6242 / CX October 9, Dimension Reduction. Guest Lecturer: Jaegul Choo

CHAPTER-13. Mining Class Comparisons: Discrimination between DifferentClasses: 13.4 Class Description: Presentation of Both Characterization and

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

ECLT 5810 Data Preprocessing. Prof. Wai Lam

Modelling and Visualization of High Dimensional Data. Sample Examination Paper

Averages and Variation

STA Module 2B Organizing Data and Comparing Distributions (Part II)

STA Learning Objectives. Learning Objectives (cont.) Module 2B Organizing Data and Comparing Distributions (Part II)

CIE L*a*b* color model

Visual Analytics. Visualizing multivariate data:

CHAPTER 3: Data Description

Chapter 3 - Displaying and Summarizing Quantitative Data

Exploratory Data Analysis EDA

STA Rev. F Learning Objectives. Learning Objectives (Cont.) Module 3 Descriptive Measures

Data Preprocessing. Javier Béjar. URL - Spring 2018 CS - MAI 1/78 BY: $\

Data Mining and Analytics. Introduction

1.3 Graphical Summaries of Data

Exploring and Understanding Data Using R.

Chapter 5: The standard deviation as a ruler and the normal model p131

Chapter 2 Modeling Distributions of Data

Chapter 2 Describing, Exploring, and Comparing Data

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Preprocessing and Visualization. Jonathan Diehl

Discriminate Analysis

Network Traffic Measurements and Analysis

10.4 Measures of Central Tendency and Variation

10.4 Measures of Central Tendency and Variation

The first few questions on this worksheet will deal with measures of central tendency. These data types tell us where the center of the data set lies.

Resting state network estimation in individual subjects

Dimension Reduction of Image Manifolds

Data Preprocessing. Javier Béjar AMLT /2017 CS - MAI. (CS - MAI) Data Preprocessing AMLT / / 71 BY: $\

Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency

Preprocessing and Visualization. Jonathan Diehl

Clustering and Visualisation of Data

Chapter 6: DESCRIPTIVE STATISTICS

Measures of Dispersion

Non-linear dimension reduction

CPSC 340: Machine Learning and Data Mining. Multi-Dimensional Scaling Fall 2017

Using the DATAMINE Program

CS6220: DATA MINING TECHNIQUES

Last time... Bias-Variance decomposition. This week

Week 7 Picturing Network. Vahe and Bethany

( ) =cov X Y = W PRINCIPAL COMPONENT ANALYSIS. Eigenvectors of the covariance matrix are the principal components

CHAPTER 1. Introduction. Statistics: Statistics is the science of collecting, organizing, analyzing, presenting and interpreting data.

IT 403 Practice Problems (1-2) Answers

CREATING THE DISTRIBUTION ANALYSIS

8: Statistics. Populations and Samples. Histograms and Frequency Polygons. Page 1 of 10

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Lecture Topic Projects

Road Map. Data types Measuring data Data cleaning Data integration Data transformation Data reduction Data discretization Summary

/4 Directions: Graph the functions, then answer the following question.

Chapter 5. Normal. Normal Curve. the Normal. Curve Examples. Standard Units Standard Units Examples. for Data

M7D1.a: Formulate questions and collect data from a census of at least 30 objects and from samples of varying sizes.

AP Statistics Summer Assignment:

Part I. Graphical exploratory data analysis. Graphical summaries of data. Graphical summaries of data

Lecture Notes 3: Data summarization

The Role of Manifold Learning in Human Motion Analysis

Regression III: Advanced Methods

Data Mining: Concepts and Techniques

Applied Regression Modeling: A Business Approach

Courtesy of Prof. Shixia University

Ex.1 constructing tables. a) find the joint relative frequency of males who have a bachelors degree.

Applications. Foreground / background segmentation Finding skin-colored regions. Finding the moving objects. Intelligent scissors

Scaling Techniques in Political Science

Grundlagen der Künstlichen Intelligenz

Points Lines Connected points X-Y Scatter. X-Y Matrix Star Plot Histogram Box Plot. Bar Group Bar Stacked H-Bar Grouped H-Bar Stacked

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

MULTIVARIATE TEXTURE DISCRIMINATION USING A PRINCIPAL GEODESIC CLASSIFIER

Page 1. Graphical and Numerical Statistics

STA Module 4 The Normal Distribution

STA /25/12. Module 4 The Normal Distribution. Learning Objectives. Let s Look at Some Examples of Normal Curves

MHPE 494: Data Analysis. Welcome! The Analytic Process

Understanding Clustering Supervising the unsupervised

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

UNIT 2 Data Preprocessing

15 Wyner Statistics Fall 2013

CS 521 Data Mining Techniques Instructor: Abdullah Mueen

3. Data Preprocessing. 3.1 Introduction

LAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA

Name: Date: Period: Chapter 2. Section 1: Describing Location in a Distribution

2. Data Preprocessing

Unsupervised Learning

Transcription:

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Exploratory data analysis tasks Examine the data, in search of structures that may indicate deeper relationships between instances or attributes A data-driven hypothesis generation process Key point: how to properly describe the data to reveal the characteristics? Data description method: Summarizing data with some statistics Portraying the distribution information of the data set

Summarizing data: Location Mean: The average value of a collection of values Median: The value that produces equal number of data points above it and below it Quartile The n-th quatile is the value that is greater than n quarter(s ) of the data points. (n = 1, 2, 3) Percentile The n-th percentile is the value that is greater than n% of the data points (n = 1,, 99)

Summarizing data: other properties Mode (quantity) The most common values Variance (dispersion) The average of the squared differences between the mean and the individual values Interquartile range (dispersion) The difference between the 1st quartile and the 3rd quartile Skewness (shape) Does a distribution have a single long tail

Data visualization: a first step toward data mining Data visualization is an important way for identifying deep relationship Pros straight-forward usually interactive ideal for sifting through data to find unexpected relation Cons requires special people to read the results to find unexpected relation might not be good for large data sets, too many details may shade the interesting patterns

Widely used methods for visualization Displaying single attribute Displaying the relationships between two attributes Displaying the relationships between multiple attributes Displaying important structure of data in a reduced number of dimensions

Tools for displaying single attribute (I) Histogram The number of different values of a nominal attribute The number of values of an numerical attribute lie in consecutive intervals Random fluctuations and alternative choices for ends may affect the diagram if the data set is small

Tools for displaying single attribute (II) Kernel estimates (Density estimation) Spread the contribution of each observed data point into the whole range, K(.) is called kernel function. K(.) is usually a smooth unimodal function with a peak at 0 e.g., The most commonly used form is the Gaussian kernel: The quality depends more on h than the shape of K

Tools for displaying single attribute (III) Box plots graphically depicting groups of numerical data through their five-number summaries Five number summary: (Min, Q1, Median, Q3, Max) Max Q3 median Q1 1.5*(Q3-Q1) Functionality : identifying possible outliers Min

Tools for displaying pair of attributes (I) Scatterplot It is a bimodality plot, where each pair of values belonging to the same instance is treated as a 2-d coordinate Functionality: revealing certain the correlation of the two variables Might not be useful especially for large data set (with long-tailed distribution) The bimodality (0-1) representation ignores the frequency of certain coordinate appears.

Tools for displaying pair of attributes (II) Contour plot It plots a 2-d density contour with respect to the concerned two variables, where the density is estimated from the observed data points. Functionality: revealing the correlation of two variables in terms of the joint distribution

Tools for displaying pair of attributes (III) Loess curve Loess curve is a local regression of the points on scatter plot Functionality: Providing better perception of the pattern of dependence.

Tools for displaying multiple attributes (I) Scatterplot matrix (pseudo-multivariate) Aligning scatterplots for every pair of attributes Functionality: Revealing certain correlation of any two attributes A pseudo-multivariate tool: since it is multiple bivariate solutions

Tools for displaying multiple attributes (II) Trellis plot Fixing a particular pair of attributes that is to be displayed produces a series of scatterplots conditioned on levels of one or more other attributes Functionality: Revealing certain correlation of any two attributes with consideration of other attributes values Considers multivariate to some extent Any type of graph can be used besides scatter plots

Tools for displaying multiple attributes (III) Icons plots It represents each instance as a multidimensional symbol Functionality: Providing a selection of instances by revealing its multivariate correlation More difficult to read. Failing to apply to large data set Revealing individual characteristics instead of global distributional information

Tools for displaying multiple attributes (III) Icons plots

Tools for displaying multiple attributes (VI) Parallel coordinates plot It represents each instance as a piecewise linear plot connecting the measured values for that instance Functionality: Providing a selection of instances by revealing its multivariate correlation More difficult to read. Failing to apply to large data set Revealing individual characteristics instead of global distributional information

Discovering important structures from data Problems with multivariate plot Difficult to read Failing to capture the global distributional information Failing to scale to large data set Too many details are represented, which shade out the interesting patterns in the data That s it! Find the interesting structures from the data while eliminating unnecessary details

Dimensionality reduction Reduce the number of dimensions (attributes) of the data while preserving the intrinsic characteristics of the data Try to identify some hidden aspects that determines the characteristics of the data Linear Method Principal component analysis (PCA) Linear discriminate analysis (LDA) Factor analysis Non-linear method KPCA, KLDA Multidimensional scaling Manifold learning methods

Principal component analysis (PCA) What hidden is the interesting aspect that structure control the of data? the data Principal component The spreading tendency

Principal component analysis (PCA) Principal component analysis seeks a space of lower dimensionality, in which the variance of the projected data are maximized. try to model how data are spread by maximizing the variance of the projected data find d orthogonal vectors that represent the spread tendency of the data (d << D) the d vectors will be used as the bases of the new space

Principal component analysis (PCA) Data representations where, x D n, Projection of a data point Projection of the data set

Principal component analysis (PCA) Variance in projected space (The score function) Maximization max w s.t. Solution: (1) by introducing Lagrange multiplier (2) Solve the equation by Eigen decomposition

Principal component analysis (PCA) Find d orthogonal principal components It can be shown that the d orthogonal principle components correspond to the d eigen vectors with d largest eigen values Intuitions: Eigen value is the variance of the data projected to the corresponding eigen vector Eigen vectors are essentially orthogonal Determine d What to discard: the eigen vectors with small eigen values Limit the loss of information within an acceptable interval (usually 5%)

Principal component analysis (PCA) Conducting PCA 1. Shifting data set to have zeros mean 2. Compute the covariance matrix S 3. Conduct eigen decomposition on S, and rank the eigen vectors according to their eigen values 4. Determine the number of dimensions d of the new space 5. Select the first d eigen vectors with d largest eigen values as the basis of the new space

Applications of PCA Visualizing the projected data Data are projected in to 2D or 3D subspace using PCA Visualize the projected data for domain experts For subsequent data mining algorithms E.g. face recognition Eigen face

Factor analysis It seeks some latent (unobserved) factors that can recover the observed variables with their linear combination. It assumes that the data are determined by these factors (through linear combination) plus some random noise The solution is not unique. Additional constraints should be imposed to obtain unique solution e.g., a widely used constraints forces the elements of Λ to be close to 1or 0, which enables to group attributes

Multidimensional scaling It seeks the coordinates in low dimensional space (usually 2D or 3D) by preserving the similarity relationships between data points. Start with a similarity (distance) matrix instead of data points Minimize some pre-defined difference between the similarity in original space and that in the new space The new space can be solved by mathematical projection or direct optimization over the coordinates Remarks Pros: More freedom to make non-linear mapping Cons: limited non-linear since it tries to preserve the similarity relationships between an instance to all other instances

Isometric feature mapping (ISOMAP) It seeks a low dimensional embedding of data such that the distance in the new space roughly equals the geodesic distances constructed through neighborhoods of instances Key step: Find the shortest path between any two points on a neighborhood graph

Let s move to Chapter 4