Data Mining - Data. Dr. Jean-Michel RICHER Dr. Jean-Michel RICHER Data Mining - Data 1 / 47

Similar documents
MULTIVARIATE ANALYSIS USING R

Work 2. Case-based reasoning exercise

k Nearest Neighbors Super simple idea! Instance-based learning as opposed to model-based (no pre-processing)

Introduction to R and Statistical Data Analysis

Data Mining: Exploring Data. Lecture Notes for Chapter 3

Introduction to Artificial Intelligence

CSC 411: Lecture 14: Principal Components Analysis & Autoencoders

CSC 411: Lecture 14: Principal Components Analysis & Autoencoders

Data Mining: Exploring Data. Lecture Notes for Data Exploration Chapter. Introduction to Data Mining

Practical Data Mining COMP-321B. Tutorial 1: Introduction to the WEKA Explorer

Data analysis case study using R for readily available data set using any one machine learning Algorithm

Linear discriminant analysis and logistic

Dimension Reduction CS534

Data Mining: Exploring Data. Lecture Notes for Chapter 3

In stochastic gradient descent implementations, the fixed learning rate η is often replaced by an adaptive learning rate that decreases over time,

MUSI-6201 Computational Music Analysis

Discriminate Analysis

Machine Learning: Algorithms and Applications Mockup Examination

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

K-means Clustering & PCA

Decision Trees In Weka,Data Formats

Network Traffic Measurements and Analysis

Hands on Datamining & Machine Learning with Weka

Data Mining: Unsupervised Learning. Business Analytics Practice Winter Term 2015/16 Stefan Feuerriegel

University of Florida CISE department Gator Engineering. Visualization

Clustering and Visualisation of Data

Introduction to Machine Learning Prof. Anirban Santara Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

arulescba: Classification for Factor and Transactional Data Sets Using Association Rules

Input: Concepts, Instances, Attributes

Data Exploration and Preparation Data Mining and Text Mining (UIC Politecnico di Milano)

Analysis and Latent Semantic Indexing

Data preprocessing Functional Programming and Intelligent Algorithms

Linear Discriminant Analysis in Ottoman Alphabet Character Recognition

Cluster Analysis and Visualization. Workshop on Statistics and Machine Learning 2004/2/6

Intro to R for Epidemiologists

EPL451: Data Mining on the Web Lab 10

STAT 1291: Data Science

UNSUPERVISED LEARNING IN PYTHON. Visualizing the PCA transformation

Image Processing. Image Features

The Curse of Dimensionality

CS 8520: Artificial Intelligence. Weka Lab. Paula Matuszek Fall, CSC 8520 Fall Paula Matuszek

KTH ROYAL INSTITUTE OF TECHNOLOGY. Lecture 14 Machine Learning. K-means, knn

Performance Analysis of Data Mining Classification Techniques

BL5229: Data Analysis with Matlab Lab: Learning: Clustering

EPL451: Data Mining on the Web Lab 5

DATA MINING TEST 2 INSTRUCTIONS: this test consists of 4 questions you may attempt all questions. maximum marks = 100 bonus marks available = 10

CSCI567 Machine Learning (Fall 2014)

Instance-Based Representations. k-nearest Neighbor. k-nearest Neighbor. k-nearest Neighbor. exemplars + distance measure. Challenges.

2. Data Preprocessing

Inf2B assignment 2. Natural images classification. Hiroshi Shimodaira and Pol Moreno. Submission due: 4pm, Wednesday 30 March 2016.

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Graphing Bivariate Relationships

Classification with Diffuse or Incomplete Information

netzen - a software tool for the analysis and visualization of network data about

K-fold cross validation in the Tidyverse Stephanie J. Spielman 11/7/2017

ADaM version 4.0 (Eagle) Tutorial Information Technology and Systems Center University of Alabama in Huntsville

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Hsiaochun Hsu Date: 12/12/15. Support Vector Machine With Data Reduction

A Course in Machine Learning

CSE 255 Lecture 5. Data Mining and Predictive Analytics. Dimensionality Reduction

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Biology Project 1

Nearest Neighbor Classification

DATA VISUALIZATION WITH GGPLOT2. Coordinates

Lecture 25: Review I

Machine Learning for. Artem Lind & Aleskandr Tkachenko

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions

DATA ANALYSIS I. Types of Attributes Sparse, Incomplete, Inaccurate Data

Experimental Design + k- Nearest Neighbors

Introduction to Machine Learning

Basic Concepts Weka Workbench and its terminology

HW Assignment 3 (Due by 9:00am on Mar 6)

Introduction to R for Epidemiologists

CS 195-5: Machine Learning Problem Set 5

Image processing & Computer vision Xử lí ảnh và thị giác máy tính

Applied Neuroscience. Columbia Science Honors Program Fall Machine Learning and Neural Networks

General Instructions. Questions

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

CSE 481C Imitation Learning in Humanoid Robots Motion capture, inverse kinematics, and dimensionality reduction

An Introduction to Cluster Analysis. Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs

Recognition: Face Recognition. Linda Shapiro EE/CSE 576

Face detection and recognition. Many slides adapted from K. Grauman and D. Lowe

Manuel Oviedo de la Fuente and Manuel Febrero Bande

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Boxplot

刘淇 School of Computer Science and Technology USTC

Visualizing high-dimensional data:

Exploratory data analysis for microarrays

Facial Expression Detection Using Implemented (PCA) Algorithm

LaTeX packages for R and Advanced knitr

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 1st, 2018

3. Data Preprocessing. 3.1 Introduction

6 Subscripting. 6.1 Basics of Subscripting. 6.2 Numeric Subscripts. 6.3 Character Subscripts

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

Programming Exercise 7: K-means Clustering and Principal Component Analysis

Unsupervised Learning

Basics of Multivariate Modelling and Data Analysis

Chapter 3: Data Description - Part 3. Homework: Exercises 1-21 odd, odd, odd, 107, 109, 118, 119, 120, odd

ECE 285 Class Project Report

K-means clustering Based in part on slides from textbook, slides of Susan Holmes. December 2, Statistics 202: Data Mining.

Transcription:

Data Mining - Data Dr. Jean-Michel RICHER 2018 jean-michel.richer@univ-angers.fr Dr. Jean-Michel RICHER Data Mining - Data 1 / 47

Outline 1. Introduction 2. Data preprocessing 3. CPA with R 4. Exercise and home work Dr. Jean-Michel RICHER Data Mining - Data 2 / 47

1. Introduction Dr. Jean-Michel RICHER Data Mining - Data 3 / 47

What we will cover Definition 1 preprocessing of data (normalization, standardization) sampling to train and test classifiers feature selection with PCA Dr. Jean-Michel RICHER Data Mining - Data 4 / 47

Raw data collection Raw data collection datasets are not always in good shape different sources of information (Database, Excel,...) missing data, errors a first pre-processing step is needed for example scaling, translation, and rotation of images rearrange, remove rows with empty data or imputate values, i.e. replace missing values using certain statistics (mean, most common,...) Dr. Jean-Michel RICHER Data Mining - Data 5 / 47

2. Data preprocessing Dr. Jean-Michel RICHER Data Mining - Data 6 / 47

Normalization Normalization Normalization and other feature scaling techniques are often mandatory in order to make comparisons between different attributes if the attributes were measured on different scales (e.g., temperatures in Kelvin and Celsius); the term normalization is often used synonymous to Min-Max scaling The scaling of attributes in a certain range, e.g., 0 to 1: X norm = X X min X max X min Dr. Jean-Michel RICHER Data Mining - Data 7 / 47

Standardization Standardization Another common approach is standardization or scaling to unit-variance every sample is subtracted by the attribute s mean µ and divided by the standard deviation σ attributes will have the properties of a standard normal distribution (µ = 0 and σ = 1) this is done in order not to have one property which will have more influence than the others Dr. Jean-Michel RICHER Data Mining - Data 8 / 47

Sampling Sampling split the dataset into a training and a test dataset to assess precision of method (classifier) training dataset is used to train the model test dataset evaluate the performance of the final model be aware of overfitting: classifiers that perform well on training data but do not generalize well Dr. Jean-Michel RICHER Data Mining - Data 9 / 47

Cross-Validation Cross-Validation most useful techniques to evaluate different combinations of feature (property) selection (take only intersting features) dimensionality reduction of learning algorithms (reduce for efficiency) multiple flavors of cross-validation, most common is k-fold cross-validation Dr. Jean-Michel RICHER Data Mining - Data 10 / 47

K-fold cross-validation K-fold the original training dataset is split into k different subsets (the so-called folds) one fold is retained as test set the other k 1 are used for training the model calculate the average error rate (and standard deviation) of the folds use matrix of statistics to see if some fold is different from others Dr. Jean-Michel RICHER Data Mining - Data 11 / 47

Feature Selection and Dimensionality Reduction Feature Selection and Dimensionality Reduction the key difference between the terms: feature selection: keep the original feature axis dimensionality reduction: usually involves a transformation technique the main goals are: noise reduction increase computational efficiency by retaining only useful (discriminatory) information avoid overfitting Dr. Jean-Michel RICHER Data Mining - Data 12 / 47

Feature Selection Feature Selection retain only those features that are meaningful and help build a good classifier by reducing the number of attributes (features, properties) Linear Discriminant Analysis (LDA): supervised, computes the directions (LD) that maximize the separation between multiple classes Principal Component Analyses (PCA): unsupervised, ignores class labels, find PC that maximize the variance in a dataset Dr. Jean-Michel RICHER Data Mining - Data 13 / 47

LDA LDA in a nutshell project a feature space (a dataset of k-dimensional samples) onto a smaller subspace h (where h k 1) while maintaining the class-discriminatory information Dr. Jean-Michel RICHER Data Mining - Data 14 / 47

PCA PCA (Principal Component Analysis reduce the dimensionality of a data set consisting of a large number of interrelated variable retain as much as possible the variation in the data set achieved by transforming to a new set of uncorrelated variables (PC) ordered so that the first few retain most of the variation present in all of the original variables Dr. Jean-Michel RICHER Data Mining - Data 15 / 47

PCA algorithm (1/2) PCA Algorithm (1/2) Take the whole dataset consisting of k-dimensional samples ignoring the class labels Compute the k-dimensional mean vector (i.e., the means for every dimension of the whole dataset) Compute the scatter matrix (alternatively, the covariance matrix) of the whole data set Compute eigenvectors (e 1, e 2,..., e k ) and corresponding eigenvalues (λ 1, λ 2,..., λ k ) Dr. Jean-Michel RICHER Data Mining - Data 16 / 47

PCA algorithm (2/2) PCA Algorithm (2/2) Sort the eigenvectors by decreasing eigenvalues and choose h eigenvectors with the largest eigenvalues to form a k h dimensional matrix W (where every column represents an eigenvector) Use this k h eigenvector matrix to transform the samples onto the new subspace y = W T x (where x is a k 1-dimensional vector representing one sample and y is the transformed h 1-dimensional sample in the new subspace Dr. Jean-Michel RICHER Data Mining - Data 17 / 47

PCA algorithm (2/2) Types of PCA general: apply algorithm to data (no transformation of data) centered: substract mean to initial values reduced: first center then reduce (if variances are significantly different, divide each variables by its standard deviation) Dr. Jean-Michel RICHER Data Mining - Data 18 / 47

3. PCA with R Dr. Jean-Michel RICHER Data Mining - Data 19 / 47

Example of PCA with R Example of PCA with R To have a better understanding of PCA we will use the IRIS flowers dataset 150 individuals (50 Setosa, 50 Versicolour, 50 Virginica) 5 properties: petal length and width sepal length and width (protection for the flower and support for the petals when in bloom) class (Setosa, Versicolour, Virginica) Dr. Jean-Michel RICHER Data Mining - Data 20 / 47

Example of PCA with R Dr. Jean-Michel RICHER Data Mining - Data 21 / 47

Example of PCA with R View the data Dataset already present > head(iris) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa Dr. Jean-Michel RICHER Data Mining - Data 22 / 47

Example of PCA with R > summary(iris) Sepal.Length Sepal.Width Petal.Length Petal.Width Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 Median :5.800 Median :3.000 Median :4.350 Median :1.300 Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 Species setosa :50 versicolor:50 virginica :50 Dr. Jean-Michel RICHER Data Mining - Data 23 / 47

Example of PCA with R Plot the data to have a better understanding (with hundreds of attribute this is not feasible) > attach(iris) > plot(iris[,1:4], pch=16) what can we really see? Dr. Jean-Michel RICHER Data Mining - Data 24 / 47

Example of PCA with R Dr. Jean-Michel RICHER Data Mining - Data 25 / 47

Example of PCA with R Plot in function of classes: > attach(iris) > pairs(iris[1:4],pch = 21, bg = c("red", "green3", "blue")[unclass(iris$species)]) Dr. Jean-Michel RICHER Data Mining - Data 26 / 47

Example of PCA with R Dr. Jean-Michel RICHER Data Mining - Data 27 / 47

Example of PCA with R In this case if you compare Sepal.Length to Sepal.Width: the red dots are separated from the blue and green dots but it will be difficult to differentiate blue from green if you compare Petal.Length to Petal.Width the differentiation is simple Dr. Jean-Michel RICHER Data Mining - Data 28 / 47

Example of PCA with R Correlation matrix values > cor(iris[,1:4]) Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Length 1.0000000-0.1175698 0.8717538 0.8179411 Sepal.Width -0.1175698 1.0000000-0.4284401-0.3661259 Petal.Length 0.8717538-0.4284401 1.0000000 0.9628654 Petal.Width 0.8179411-0.3661259 0.9628654 1.0000000 Then plot the correlation matrix library(corrplot) corrplot(cor(iris[,1:4]), method="color") Dr. Jean-Michel RICHER Data Mining - Data 29 / 47

Example of PCA with R Dr. Jean-Michel RICHER Data Mining - Data 30 / 47

Example of PCA with R Correlation explanation if the correlation value is close to 1, then both vectors of values are strongly correlated close to 0, then they are not correlated In this example Sepal.Length and Petal.Length are strongly correlated, so it is interesting to remove one of them Dr. Jean-Michel RICHER Data Mining - Data 31 / 47

Example of PCA with R What we want to know is there a difference between the species? which properties can explain thoses differences? Dr. Jean-Michel RICHER Data Mining - Data 32 / 47

Example of PCA with R Perform the PCA with R pca = prcomp(iris[,1:4]) The principal component are obtained with pca$rotation Dr. Jean-Michel RICHER Data Mining - Data 33 / 47

Example of PCA with R Principal components > pca$rotation PC1 PC2 PC3 PC4 Sepal.Length 0.36138659-0.65658877 0.58202985 0.3154872 Sepal.Width -0.08452251-0.73016143-0.59791083-0.3197231 Petal.Length 0.85667061 0.17337266-0.07623608-0.4798390 Petal.Width 0.35828920 0.07548102-0.54583143 0.7536574 0.86 Petal.Length + 0.36 Sepal.Length PC1 = + 0.36 Petal.Width 0.08 Sepal.Width Dr. Jean-Michel RICHER Data Mining - Data 34 / 47

Example of PCA with R Principal components information pca$sdev gives the pourcentage of information in each component: > pca$sdev [1] 2.0562689 0.4926162 0.2796596 0.1543862 > 100 * pca$sdev^2 / sum(pca$sdev^2) [1] 92.4618723 5.3066483 1.7102610 0.5212184 > sum(100 * (pca$sdev^2)[1:1] / sum(pca$sdev^2)) [1] 92.46187 > sum(100 * (pca$sdev^2)[1:2] / sum(pca$sdev^2)) [1] 97.76852 > sum(100 * (pca$sdev^2)[1:3] / sum(pca$sdev^2)) [1] 99.47878 > pca$rotation Already 92% of the information is given by the first component! Dr. Jean-Michel RICHER Data Mining - Data 35 / 47

Example of PCA with R Dr. Jean-Michel RICHER Data Mining - Data 36 / 47

Example of PCA with R Print individuals in function of PC > plot(pca$x[,1], rep(0,nrow(iris)), col=iris[,5]) or > library(ggfortify) > df <- iris[c(1, 2, 3, 4)] > autoplot(prcomp(df), data = iris, colour = 'Species') > autoplot(prcomp(df), data = iris, colour = 'Species', label = TRUE, label.size = 3) Dr. Jean-Michel RICHER Data Mining - Data 37 / 47

Example of PCA with R Dr. Jean-Michel RICHER Data Mining - Data 38 / 47

Example of PCA with R Dr. Jean-Michel RICHER Data Mining - Data 39 / 47

Example of PCA with R Dr. Jean-Michel RICHER Data Mining - Data 40 / 47

Example of PCA with R Final interpretation setosa (red) can be identified clearly as the PC 1 is composed of 0.86 Petal.Length, it is the length of the petal that can help differentiate species Dr. Jean-Michel RICHER Data Mining - Data 41 / 47

More information about PCA More information Follow the following links http://www.sthda.com/english/wiki/print.php?id=206 https://www.analyticsvidhya.com/blog/2016/03/ practical-guide-principal-component-analysis-python/ Dr. Jean-Michel RICHER Data Mining - Data 42 / 47

4. Exercise and home work Dr. Jean-Michel RICHER Data Mining - Data 43 / 47

Exercise about PCA PCA with python install the sklearn package and try to do as R display information: data, correlation matrix perform the PCA, print results try to plot data Dr. Jean-Michel RICHER Data Mining - Data 44 / 47

Home work about PCA PCA with Weka write a report explaining how to perform PCA using Weka give detail of each step Dr. Jean-Michel RICHER Data Mining - Data 45 / 47

4. End Dr. Jean-Michel RICHER Data Mining - Data 46 / 47

University of Angers - Faculty of Sciences UA - Angers 2 Boulevard Lavoisier 49045 Angers Cedex 01 Tel: (+33) (0)2-41-73-50-72 Dr. Jean-Michel RICHER Data Mining - Data 47 / 47