Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 1st, 2018

Similar documents
Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Regression III: Advanced Methods

Visualizing and Exploring Data

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data.

Data Mining and Analytics. Introduction

Learning from High Dimensional fmri Data using Random Projections

Visual Analytics. Visualizing multivariate data:

Part I, Chapters 4 & 5. Data Tables and Data Analysis Statistics and Figures

The exam is closed book, closed notes except your one-page cheat sheet.

STA 570 Spring Lecture 5 Tuesday, Feb 1

STP 226 ELEMENTARY STATISTICS NOTES PART 2 - DESCRIPTIVE STATISTICS CHAPTER 3 DESCRIPTIVE MEASURES

Network Traffic Measurements and Analysis

CSC 411: Lecture 14: Principal Components Analysis & Autoencoders

Unsupervised Learning

Kernel Principal Component Analysis: Applications and Implementation

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

CSC 411: Lecture 14: Principal Components Analysis & Autoencoders

Chapter 1. Looking at Data-Distribution

Name Date Types of Graphs and Creating Graphs Notes

Preprocessing and Visualization. Jonathan Diehl

ECLT 5810 Data Preprocessing. Prof. Wai Lam

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques

CHAPTER-13. Mining Class Comparisons: Discrimination between DifferentClasses: 13.4 Class Description: Presentation of Both Characterization and

Last time... Coryn Bailer-Jones. check and if appropriate remove outliers, errors etc. linear regression

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques.

Data Analysis 3. Support Vector Machines. Jan Platoš October 30, 2017

Kernel Density Estimation (KDE)

Data Mining: Exploring Data. Lecture Notes for Chapter 3

HW2 due on Thursday. Face Recognition: Dimensionality Reduction. Biometrics CSE 190 Lecture 11. Perceptron Revisited: Linear Separators

Data Preprocessing. Javier Béjar. URL - Spring 2018 CS - MAI 1/78 BY: $\

Data Mining: Concepts and Techniques

IAT 355 Visual Analytics. Data and Statistical Models. Lyn Bartram

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Data Mining: Exploring Data. Lecture Notes for Data Exploration Chapter. Introduction to Data Mining

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display

Data Preprocessing. Javier Béjar AMLT /2017 CS - MAI. (CS - MAI) Data Preprocessing AMLT / / 71 BY: $\

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Chapter 3 - Displaying and Summarizing Quantitative Data

Chapter 2 Describing, Exploring, and Comparing Data

Unsupervised Learning

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

CHAPTER 1. Introduction. Statistics: Statistics is the science of collecting, organizing, analyzing, presenting and interpreting data.

Introduction to Geospatial Analysis

CS6220: DATA MINING TECHNIQUES

STA Rev. F Learning Objectives. Learning Objectives (Cont.) Module 3 Descriptive Measures

Statistical Methods. Instructor: Lingsong Zhang. Any questions, ask me during the office hour, or me, I will answer promptly.

LAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA

Chapter 3: Data Description - Part 3. Homework: Exercises 1-21 odd, odd, odd, 107, 109, 118, 119, 120, odd

A Dendrogram. Bioinformatics (Lec 17)

All lecture slides will be available at CSC2515_Winter15.html

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

Exploratory Data Analysis EDA

IT 403 Practice Problems (1-2) Answers

Data Mining: Exploring Data. Lecture Notes for Chapter 3

Face detection and recognition. Detection Recognition Sally

a. divided by the. 1) Always round!! a) Even if class width comes out to a, go up one.

Using the DATAMINE Program

STA Module 2B Organizing Data and Comparing Distributions (Part II)

STA Learning Objectives. Learning Objectives (cont.) Module 2B Organizing Data and Comparing Distributions (Part II)

Chapter 5. Understanding and Comparing Distributions. Copyright 2012, 2008, 2005 Pearson Education, Inc.

Exploring and Understanding Data Using R.

Chapter 2 Data Preprocessing

WELCOME! Lecture 3 Thommy Perlinger

MSA220 - Statistical Learning for Big Data

Preprocessing and Visualization. Jonathan Diehl

Chapter 6: DESCRIPTIVE STATISTICS

Feature Selection for Image Retrieval and Object Recognition

Data Preprocessing. Chapter Why Preprocess the Data?

STA 490H1S Initial Examination of Data

Generative and discriminative classification techniques

Content-based image and video analysis. Machine learning

Road Map. Data types Measuring data Data cleaning Data integration Data transformation Data reduction Data discretization Summary

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

2.1 Objectives. Math Chapter 2. Chapter 2. Variable. Categorical Variable EXPLORING DATA WITH GRAPHS AND NUMERICAL SUMMARIES

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

Kernel Methods and Visualization for Interval Data Mining

15 Wyner Statistics Fall 2013

MCMC Methods for data modeling

CHAPTER 3. Preprocessing and Feature Extraction. Techniques

CPSC 340: Machine Learning and Data Mining. Kernel Trick Fall 2017

Louis Fourrier Fabien Gaie Thomas Rolf

Automated Canvas Analysis for Painting Conservation. By Brendan Tobin

JMP Book Descriptions

The Curse of Dimensionality

CSE 417T: Introduction to Machine Learning. Lecture 22: The Kernel Trick. Henry Chai 11/15/18

AND NUMERICAL SUMMARIES. Chapter 2

CHAPTER 3: Data Description

Support vector machine (II): non-linear SVM. LING 572 Fei Xia

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Data Mining - Data. Dr. Jean-Michel RICHER Dr. Jean-Michel RICHER Data Mining - Data 1 / 47

Machine Learning for Signal Processing Clustering. Bhiksha Raj Class Oct 2016

Data preprocessing Functional Programming and Intelligent Algorithms

Multivariate Analysis

Notes for week 3. Ben Bolker September 26, Linear models: review

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Image Processing. Image Features

Vocabulary. 5-number summary Rule. Area principle. Bar chart. Boxplot. Categorical data condition. Categorical variable.

Transcription:

Data Mining CS57300 Purdue University Bruno Ribeiro February 1st, 2018 1

Exploratory Data Analysis & Feature Construction How to explore a dataset Understanding the variables (values, ranges, and empirical distribution) Finding easy relationships between variables Building features from data How to better represent the data for various data mining tasks 2

Data exploration and visualization

Visualization Human eye/brain have evolved powerful methods to detect structure in nature Display data in ways that exploit human pattern recognition abilities Limitation: Can be difficult to apply if data size (number of dimensions or instances) is large

Exploratory data analysis Data analysis approach that employs a number of (mostly graphical) techniques to: Maximize insight into data Uncover underlying structure Identify important variables Detect outliers and anomalies Test underlying modeling assumptions Develop parsimonious models Generate hypotheses from data

Visualizing/summarizing data Low-dimensional data Summarizing data with simple statistics Plotting raw data (1D, 2D, 3D) Higher-dimensional data Principal component analysis Multidimensional scaling

Data summarization Measures of location Mean: ˆµ = 1 n n i=1 x(i) Median: value with 50% of points above and below Quartile: value with 25% (75%) points above and below Mode: most common value

Data summarization Measures of dispersion or variability Variance: ˆ 2 k = 1 n n i=1 (x(i) µ)2 Standard deviation: ˆ k = 1 n n i=1 (x(i) µ)2 Range: difference between max and min point Interquartile range: difference between 1 st and 3 rd Q Skew: P n i=1 (x(i) ˆµ)3 ( P n i=1 (x(i) ˆµ)2 ) 3 2

Histograms (1D) Most common plot for univariate data Split data range into equal-sized bins, count number of data points that fall into each bin Graphically shows: Center (location) Spread (scale) Skew Outliers Multiple modes

Problem with Standard Histograms Test the following on R: hist(c(0,0,0,2,3),breaks=c(0,1-1e-2,2,3)) hist(c(0,0,0,2,3),breaks=c(0,0.5,1-1e-2,2,3)) hist(c(0,0,0,2,3),breaks=c(0,0.5,1-1e-2,1.5,2,2.5,3)) Standard histograms give inconsistent results for continuous data Visualization highly dependent on binning 10

Quantile Histogram (better) Histogram of DEM Batches (Average Donations) Density 0.0000 0.0005 0.0010 0.0015 0.0020 0.0025 0.0030 0.0035 1500 2000 2500 3000 Average Amount (USD) Probability and Statistics for Engineering and the Sciences, Jay Devore, 9 th Ed, Brooks Cole Each bin has approximately the same number of data points 11

Quantile Histogram (better) Histogram of DEM Batches (Average Donations) Histogram of DEM Batches (Average Donations) Density 0.0000 0.0005 0.0010 0.0015 0.0020 0.0025 0.0030 0.0035 vs 1500 2000 2500 3000 Average Amount (USD) Density 0.0000 0.0002 0.0004 0.0006 0.0008 0.0010 0.0012 0.0014 1500 2000 2500 3000 Average Amount (USD) Quantile histogram Standard histogram 12

R code (of example) dem = read.csv("dem_donations_2012.csv",header=true) gop = read.csv("gop_donations_2012.csv",header=true) demdonations = as.matrix(dem$donation[dem$donation > 0]) gopdonations = as.matrix(gop$donation[gop$donation > 0]) N = round(nrow(demdonations)/200) # randomly split DEMs into N groups of average length nrow(demdonations)/n demsplit = as.list(split(demdonations, sample(1:n, nrow(demdonations), replace=t))) # find average of each bin dem_bin_averages = unlist(lapply(demsplit,mean)) # find 20 quantiles of each the batch averages quantiles_dem = quantile(dem_bin_averages,probs = seq(0, 1, 1/20)) # plot histogram hist(dem_bin_averages,main="histogram of DEM Batches (Average Donations)",freq=FALSE,breaks=quantiles_dem,xlim=c(1500,3000)) N = round(nrow(gopdonations)/200) # randomly split GOP into N groups of average length nrow(demdonations)/n gopsplit = split(gopdonations, sample(1:n, nrow(gopdonations), replace=t)) # find average of each bin gop_bin_averages = unlist(lapply(gopsplit,mean)) # find 20 quantiles of each the batch averages quantiles_gop = quantile(gop_bin_averages,probs = seq(0, 1, 1/20)) # plot histogram hist(gop_bin_averages,main="histogram of GOP Batches (Average Donations)",freq=FALSE,breaks=quantiles_gop,xlim=c(1500,3000)) 13

Alternative to Histograms: Density plots Estimated density is: Density from sum of kernels Two parameters: Kernel function K (e.g., Gaussian, Epanechnikov) Kernel over data point Bandwidth h Data point

Bar plots

Box plot (2D) Display relationship between discrete and continuous variables For each discrete value X, calculate quartiles and range of associated Y values Ack: João Elias Vidueira Ferreira

Scatter plot (2D) Most common plot for bivariate data Horizontal X axis: the suspected independent variable Vertical Y axis: the suspected dependent variable Graphically shows: If X and Y are related Linear or non-linear relationship If the variation in Y depends on X Outliers

No relationship

Linear relationship

Non-linear relationship

Homoskedastic (equal variance)

Heteroskedastic (unequal variance)

Scatterplot limitations Overprinting Solution: Jitter points

Contour plot (3D) Represents a 3D surface by plotting constant z slices (contours) in a 2D format Can overcome some limitations of 2D scatterplot (e.g., when there is too much data to discern relationship)

http://extremepresentation.typepad.com/blog/2006/09/choosing_a_good.html

Constructing Features from Data 26

Whitening (Normalization) For numerical features (not categorical) It is common to whiten each feature by subtracting its mean and dividing by its standard deviation Which dimension to whiten depends on the application Netflix & Amazon: user rating should be whiten within a user: subtract the rating by the same user s average rating and divide by his/her standard deviation Images should be whiten with respect to the entire image Other features, such as age, should be whiten across all users For regularization, this helps all the features be penalized in the same units, that is, we are assuming they have the same variance σ 2 27

The Curse of Dimensionality For classification tasks with universal approximators, high-dimensional data is OK (e.g., deep neural networks and decision trees) For simpler, linear discriminators that rely on inner products between data points, high-dimensional data is a curse (e.g., logistic and linear regressions, SVMs), Data in only one dimension is relatively packed Adding a dimension stretches the points across that dimension, making them further apart Adding more dimensions will make the points further apart high dimensional data is extremely sparse Distance measure becomes meaningless (ack Evimaria Terzi, Parsons et al.)

Kernels (for linear discriminators) A kernel K is a form of similarity function (replaces inner product) K(u,v) > 0 is the similarity between vectors u, v X, u v Mercer s theorem: For every continuous symmetric positive semi-definite kernel K there is a feature vector function φ such that K(u,v) = φ(u) φ(v) Fig ack Tommi Jaakkola 29

Some Common Kernels Polynomials of degree up to d K(u, v) =(u T v + c) d Gaussian/Radial kernels (polynomials of all orders projected space has infinite dimension) ku vk 2 K(u, v) =exp 2 2 Sigmoid K(u, v) = tanh au T v + c 30

Other Forms of Dimensionality Reduction Dataset X consisting of n points in a d-dimensional space Data point x i R d (d-dimensional real vector): x i = [x i1, x i2,, x id ] Dimensionality reduction methods: Feature selection: choose a subset of the features Feature extraction: create new features by combining new ones

Dimensionality reduction Dimensionality reduction methods: Feature selection: choose a subset of the features Feature extraction: create new features by combining new ones Both methods map vector x i R d, to vector y i R k, (k<<d) F : R d àr k

Random Projections It is also possible to learn models & classifiers over random projections of the data Johnson-Lindenstrauss Lemma A given a set S R n, if we perform an orthogonal projection of those points onto a random d-dimensional subspace, then d = O( γ -2 log S ) is sufficient so that with high probability all pairwise distances are preserved up to 1 ± γ 33

Finding random projections Vectors x i R d, are projected onto a k-dimensional space (k<<d) Random projections can be represented by linear transformation matrix R y i = R x i What is the matrix R?

Finding matrix R Elements R i,j can be Gaussian distributed Achlioptas* has shown that the Gaussian distribution can be replaced by R( i, j) = ì 1 ï + 1with prob 6 ï 2 í 0 with prob ï 3 ï 1-1with prob ïî 6 All zero mean, unit variance distributions for R i,j would give a mapping that satisfies the Johnson-Lindenstrauss lemma * Journal of Computer and System Sciences 2003