Chapter II Multiple Correspondance Analysis (MCA)

Size: px
Start display at page:

Download "Chapter II Multiple Correspondance Analysis (MCA)"

Transcription

1 Chapter II Multiple Correspondance Analysis (MCA) Master MMAS - University of Bordeaux Marie Chavent Chapitre 2 MCA 1/52

2 Introduction How to get information from a categorical data table of individuals variables? Example : categorical data table where 27 dogs are described on 6 variables #load("chiensrdata") load("dogsrdata") print(data[1:8,]) ## Size Weight Velocity Intelligence Affectivity Aggressivness ## Beauceron S++ W+ V++ I+ Af+ Ag+ ## BassetHound S- W- V- I- Af- Ag+ ## GermanShepherd S++ W+ V++ I++ Af+ Ag+ ## Boxer S+ W+ V+ I+ Af+ Ag+ ## Bulldog S- W- V- I+ Af+ Ag- ## BullMastiff S++ W++ V- I++ Af- Ag+ ## Poodle S- W- V+ I++ Af+ Ag- ## Chihuahua S- W- V- I- Af+ Ag- - Which individuals are similar? - Which variables are linked? Chapitre 2 MCA 2/52

3 By looking the matrix of the distances between the individuals? d <- dist(data) asmatrix(d)[1:5,1:5] ## Beauceron BassetHound GermanShepherd Boxer Bulldog ## Beauceron 0 NA NA NA NA ## BassetHound NA 0 NA NA NA ## GermanShepherd NA NA 0 NA NA ## Boxer NA NA NA 0 NA ## Bulldog NA NA NA NA 0 How do we measure the distance between two rows of categorical data? Chapitre 2 MCA 3/52

4 By looking the matrix of the χ 2 of independance between the pairs of variables? p <- ncol(data) ; chi2 <- matrix(na,p,p) ; pval <- matrix(na,p,p) rownames(pval) <- colnames(pval) <- rownames(chi2) <- colnames(chi2) <- colnames(data) for (j in 1:p) for (k in 1:p) { tab <- table(data[,j],data[,k]) chi2[j,k] <- chisqtest(tab)$statistic pval[j,k] <- chisqtest(tab)$pvalue } print(chi2,digit=2) # value of the chi2 statistic ## Size Weight Velocity Intelligence Affectivity Aggressivness ## Size ## Weight ## Velocity ## Intelligence ## Affectivity ## Aggressivness print(round(pval,digit=3),digit=2) # p-value of the test of independance ## Size Weight Velocity Intelligence Affectivity Aggressivness ## Size ## Weight ## Velocity ## Intelligence ## Affectivity ## Aggressivness Chapitre 2 MCA 4/52

5 By applying a multivariate statistical method? Multiple Correspondance Analysis (MCA) gives graphical representation of the distances between the individuals, the links between the categorival variables and the levels library(factominer) res <- MCA(data,graph=FALSE) plot(res,choix="ind",invisible = "var", title="",cex=15) plot(res,choix="ind",invisible = "ind", title="",cex=15) plot(res,choix="var",invisible = "ind", title="",cex=15) Dim 2 (2308%) BassetHound Mastiff Chihuahua Pekingese BullMastiff SaintBernard Teckel GermanMastiff Bulldog Newfoundland CockerSpaniel FoxTerrier FoxHound Levrier Poodle GrdBleuGascon DobermanPointerSetter GermanShepherd Beauceron CollieFrenchSpaniel Boxer Dalmatien BittanySpaniel Labrador Dim 2 (2308%) V W++ S I W Af Ag+ S++ Ag Af+ V++ I+ I++ W+ V+ S Dim 2 (2308%) Weight Velocity Size Intelligence Affectivity Aggressivness Dim 1 (2890%) Dim 1 (2890%) Dim 1 (2890%) Chapitre 2 MCA 5/52

6 MCA is also a method of dimension reduction : it gives a small number of new synthetic numerical variables summarizing the initial variables Categorical data : 8 initial categorical data Numerical data : 3 synthetic numerical variables ## Size Weight Velocity Intelligence Affectivity Aggressivness ## Beauceron S++ W+ V++ I+ Af+ ## Ag+ Dim 1 Dim 2 Dim 3 ## BassetHound S- W- V- I- Af- ## Beauceron Ag ## GermanShepherd S++ W+ V++ I++ Af+ ## BassetHound Ag ## Boxer S+ W+ V+ I+ Af+ ## GermanShepherd Ag ## Bulldog S- W- V- I+ Af+ ## Boxer Ag ## BullMastiff S++ W++ V- I++ Af- ## BulldogAg ## Poodle S- W- V+ I++ Af+ ## BullMastiff Ag ## Chihuahua S- W- V- I- Af+ ## PoodleAg ## Chihuahua MCA is then also a method to transform categorical data into numerical data Chapitre 2 MCA 6/52

7 Plan 1 Basic notions 2 The MCA algorithm 3 Different implementations of MCA 4 Interpretation of the results Chapitre 2 MCA 7/52

8 1 Basic notions Let us consider a data table where n individuals are described on p categorical variables Let : 1 j p 1 i x ij n - X = (x ij) n p denote the original data matrix whith x ij M j and M j the set of the levels of the jth variable - m j = card(m j) denote the number of levels of the jth variable - m = m m p denote the total number of levels Chapitre 2 MCA 8/52

9 Example : categorical data with n = 27 individuals, p = 6 variables and m = 16 levels print(data[1:8,]) ## Size Weight Velocity Intelligence Affectivity Aggressivness ## Beauceron S++ W+ V++ I+ Af+ Ag+ ## BassetHound S- W- V- I- Af- Ag+ ## GermanShepherd S++ W+ V++ I++ Af+ Ag+ ## Boxer S+ W+ V+ I+ Af+ Ag+ ## Bulldog S- W- V- I+ Af+ Ag- ## BullMastiff S++ W++ V- I++ Af- Ag+ ## Poodle S- W- V+ I++ Af+ Ag- ## Chihuahua S- W- V- I- Af+ Ag- Levels of the variables : T-,T+,T++ (taille), P-,P+,P++ (poids), etc Two approaches for recoding the categorical data into numerical data : - build the disjonctive table where each levels is coded is coded as a binary variable, - build the Burt table (anglo-saxon approach) which gathers the contingency tables of all the pairs of variables Chapitre 2 MCA 9/52

10 The disjonctive table describes the n individuals on the m levels : 1 s m 1 K = i k is n total Each column s is the indicator vector of the level s with : { kis = 1 if individual i has level s k is = 0 otherwise Let n s denote the number of individuals having level s n s Chapitre 2 MCA 10/52

11 Disjonctive table of the m = 16 levels library(factominer) K <- tabdisjonctif(data) print(k[1:4,]) ## S- S+ S++ W- W+ W++ V- V+ V++ I- I+ I++ Af- Af+ Ag- Ag+ ## Beauceron ## BassetHound ## GermanShepherd ## Boxer Frequencies n s of the levels : ns <- apply(k,2,sum) print(ns) ## S- S+ S++ W- W+ W++ V- V+ V++ I- I+ I++ Af- Af+ Ag- Ag+ ## Relative frequencies ns n ns <- apply(k,2,sum) n <- nrow(k) print(ns/n) of the levels : ## S- S+ S++ W- W+ W++ V- V+ V++ I- I+ I++ Af- Af+ Ag- Ag+ ## Chapitre 2 MCA 11/52

12 Centered disjonctive table - The n rows of the matrix K (the disjonctive table) define a cloud of n points in R m - Each individual i is weighted by w i and usually w i = 1 n Matrix K of the original recoded data 1 s m 1 i k is n mean n s n Matrix Z of the centered data 1 s m 1 i z is = k is n s n n mean 0 var n s n (1 n s n ) Verify that var(z s ) = ns ns (1 ) where n n zs R n denotes s-th column of Z Chapitre 2 MCA 12/52

13 Distance between two individuals - A weight m s is associated with each level s in order to give more importance to rare levels : m s = n n s - The metric M = diag( n n s, s = 1, m) of the diagonal matrix of the weights of the columns gives : d 2 M(z i, z i ) = = m s=1 m s=1 n n s (z is z i s) 2 n n s (k is k i s) 2 Two individuals are different if they have different levels, with more weight in the distance for rare levels (n s small) Chapitre 2 MCA 13/52

14 Example : ## S- S+ S++ W- W+ W++ V- V+ V++ I- I+ I++ Af- Af+ Ag- Ag+ ## Beauceron ## BassetHound ## GermanShepherd ## Boxer ## Bulldog Relative frequency of the levels : ## S- S+ S++ W- W+ W++ V- V+ V++ I- I+ I++ Af- Af+ Ag- Ag+ ## Squared distance between the two first dogs : dm(z 2 1, z 2) = (0 1) (0 0) (1 1)2 048 Chapitre 2 MCA 14/52

15 Inertia of the disjonctive table We have seen in the slides about the basic notions for PCA that : - centering the data doesn t change the distances between the individuals and then the inertia, - the inertia of a data table is the (weighted) sum of the variances of its columns In the particular case of a disjonctive table K this gives : - I(K) = I(Z) where Z is the centered disjonctive table and : I(Z) = m m svar(z s ), s=1 where m s is the weight of the column (the level) s Chapitre 2 MCA 15/52

16 - This gives when the rows are weighted by 1 and the columns are n weighted by m s = n n s : I(Z) = m (1 ns n ) s=1 In practice : - The contribution of a level s to the inertia of Z is all the more important as the level is rare - Too rare levels are then avoided (by pre-processing for instance) Chapitre 2 MCA 16/52

17 - This gives also : In practice : I(Z) = p (m j 1) j=1 - The contribution of a variable j to the inertia of Z is all the more imporant as its number of levels m j is high - Variables with too different number of levels are then avoided (by pre-processing for instance) Chapitre 2 MCA 17/52

18 - This gives also : I(Z) = m p Example of the dogs : #number of variables ncol(data) ## [1] 6 #number of levels ncol(k) ## [1] 16 I(Z) = 16 6 = 10 Chapitre 2 MCA 18/52

19 The correlation ratio The link between a numerical variable y and a categorical variable x is often measure by : η 2 (y x) = var(ȳ x) var(y) n i=1 n s m s=1 = n (ȳ s ȳ) 2 1 (yi ȳ)2 n where m is the number of levels of x and ȳ s is the mean value of y performed with the individuals having the level s - This criterion is often named correlation ratio - It takes its values in [0, 1] - It measures the proportion of the variance of the numerical variable y explained by the categorical variable x In which situation is this criterion equal to 0, equal to 1? Chapitre 2 MCA 19/52

20 Example : The Iris data ## SepalLength SepalWidth PetalLength PetalWidth Species ## setosa ## setosa ## setosa ## versicolor ## versicolor ## virginica Correlation ratios between the variable Species and the 4 numerical variables : eta2 <- function(x, gpe) { moyennes <- tapply(x, gpe, mean) effectifs <- tapply(x, gpe, length) varinter <- (sum(effectifs * (moyennes - mean(x))^2)) vartot <- (var(x) * (length(x) - 1)) res <- varinter/vartot return(res) } apply(iris[,-5],2,function(x){eta2(x,iris$species)}) ## SepalLength SepalWidth PetalLength PetalWidth ## Chapitre 2 MCA 20/52

21 The variable Species explains : % of the variance of "Petal Length" - 40 % of the variance of "Sepal Length" Petal Length Sepal Width setosa versicolor virginica setosa versicolor virginica Chapitre 2 MCA 21/52

22 Give an interpretation of the graphical outputs below : res <- PCA(iris,qualisup = 5,graph=FALSE) plot(res,choix="ind",habillage=5, title="",label="none",invisible="quali") plot(res,choix="var",title="",cex=15) Dim 2 (2285%) setosa versicolor virginica Dim 2 (2285%) SepalWidth Dim 1 (7296%) SepalLength PetalWidth PetalLength Dim 1 (7296%) How is this interpretation coherent with the results of the correlation ratios? Chapitre 2 MCA 22/52

23 Plan 1 Basic notions 2 The MCA algorithm 3 Different implementations of MCA 4 Interpretation of the results Chapitre 2 MCA 23/52

24 2 The MCA algorithm Several algorithms exist to perform Multiple Correspondance Analysis (MCA) and MCA can be defined as : - Correspondance Analysis (CA) applied to the Burt table (anglo-saxon approach) or to the disjonctive table (french approach), - Principal Component Analysis (PCA) applied to the centered disjonctive table (the approach describe in this Chapter) Because the CA method in not studied in this lecture, the MCA algorithm described hereafter is based on the general framework of PCA with metric introduced in the section 4 of the Chapter I Chapitre 2 MCA 24/52

25 The MCA algorithm The data table to be analyzed by MCA comprises n individuals described by p categorical variables and it is represented by the n p categorical matrix X Let m denote the total number of levels of the p categorical variables Step 1 : the pre-processing step 1 Build the real matrix Z of dimension n m as follows : Each level is coded as a binary variable and the n m disjonctive table K is constructed Z is the centered version of K 2 Build the diagonal matrix N of the weights of the rows of Z The n rows are often weighted by 1 n, such that N = 1 n In 3 Build the diagonal matrix M of the weights of the columns of Z : The m columns (corresponding to the levels of the categorical variables) are weighted by n n s, where n s, s = 1,, m denotes the number of individuals that belong to the sth level Chapitre 2 MCA 25/52

26 The metric M = diag( n n 1,, n n m ) (1) indicates that the distance between two rows of Z is weighted euclidean distance in the spirit of the χ 2 distance used in CA This distance gives more importance to rare levels The total inertia of Z with this distance and the weights 1 is equal to m p n Chapitre 2 MCA 26/52

27 Step 2 : the factor coordinates processing step 1 The Generalized Singular Value Decomposition (GSVD) of Z with metrics N and M gives the decomposition : where Z = UΛV t (2) - Λ = diag( λ 1,, λ r ) is the r r diagonal matrix of the singular values of ZMZ t N and Z t NZM, and r denotes the rank of Z which can best be here r = min(n 1, m p) ; - U is the n r matrix of the first r eigenvectors of ZMZ t N such that U t NU = I r, with I r the identity matrix of size r ; - V is the p r matrix of the first r eigenvectors of Z t NZM such that V t MV = I r Chapitre 2 MCA 27/52

28 2 The matrix F of dimension n r of the factor coordinates of the individuals is defined by : F = ZMV, (3) and we deduce from (2) that : F = UΛ (4) 1 α r 1 i f iα n mean 0 var λ α The columns f α of F are the principal components and : var(f α ) = λ α The columns u α = the standardized principal components λ fα of U are α Chapitre 2 MCA 28/52

29 The matrix F res <- MCA(data,graph=FALSE) F <- res$ind$coord F[,1:2] ## Dim 1 Dim 2 ## Beauceron ## BassetHound ## GermanShepherd ## Boxer ## Bulldog ## BullMastiff ## Poodle ## Chihuahua ## CockerSpaniel ## Collie ## Dalmatien ## Doberman ## GermanMastiff ## BittanySpaniel ## FrenchSpaniel ## FoxHound ## FoxTerrier ## GrdBleuGascon ## Labrador ## Levrier ## Mastiff ## Pekingese ## Pointer ## SaintBernard ## Setter ## Teckel ## Newfoundland Individuals plotted according to the two first PCs plot(res,choix="ind",invisible="var", cex=15,title="") Dim 2 (2308%) res$eig[1:2,1] ## [1] BassetHound Mastiff Chihuahua Pekingese BullMastiff SaintBernard Teckel GermanMastiff Newfoundland Bulldog FoxHound CockerSpaniel GrdBleuGascon Levrier Doberman PointerSetter GermanShepherd Beauceron CollieFrenchSpaniel Dim 1 (2890%) FoxTerrier Poodle Boxer Labrador Dalmatien BittanySpaniel Chapitre 2 MCA 29/52

30 3 The matrix A of dimension m r of the factor coordinates of the levels is defined by : A = MZ t NU = MA, (5) and we deduce from (2) that : A = MVΛ (6) 1 α q 1 s asα m Each coordinate a sα (element of A ) is the mean value of the (standardized) factor coordinates of the individuals that belong to level s : a sα = 1 n s i:k is =1 f iα λα This relation is called the barycentric property This property is fondamental for the interpretation of the graphical outputs in MCA Chapitre 2 MCA 30/52

31 The matrix A A <- res$var$coord A[,1:2] ## Dim 1 Dim 2 ## S ## S ## S ## W ## W ## W ## V ## V ## V ## I ## I ## I ## Af ## Af ## Ag ## Ag Plot of the levels according to their factor coordinates on dim1-2 plot(res,choix="ind",invisible="ind", cex=15,title="") Dim 2 (2308%) W++ Af S++ V++ Ag+ I I++ W+ V V+ W Ag I+ Af+ S+ S Dim 1 (2890%) The coordinates of the level W++ are the mean of the standardized coordinates of the dogs that belong to W++ rownames(data)[which(data$weight=="w++")] ## [1] "BullMastiff" "GermanMastiff" "Mastiff" "SaintBernard" "Newfoundland" Chapitre 2 MCA 31/52

32 Is it possible to plot both individuals and levels on the same map? It is possible to plot the levels at the barycenter of the individuals by using the barycentric property asα = 1 f iα n s λα i:k is =1 In that case two dimensions are chosen and : - the individuals are plotted according to their standardized principal components fα, λ α - the levels are plotted according to their factor coordinates vectors a α Chapitre 2 MCA 32/52

33 Example of the dogs data : Levels at the barycenter of the individuals second standardized PC BassetHound Mastiff Chihuahua Pekingese W++ V BullMastiff SaintBernard S GermanMastiff I W Bulldog Teckel Newfoundland Af Ag+ FoxTerrier FoxHound CockerSpaniel S++ Poodle Levrier GrdBleuGascon V++ I+ Ag Af+ Doberman I++ Pointer Setter GermanShepherd Beauceron W+ FrenchSpaniel Collie V+ S+ Boxer BittanySpaniel Dalmatien Labrador first standardized PC For instance the level W++ is plotted at the barycenter of the dogs that belong to W++ rownames(data)[which(data$weight=="w++")] ## [1] "BullMastiff" "GermanMastiff" "Mastiff" "SaintBernard" "Newfoundland" Chapitre 2 MCA 33/52

34 However this simultaneous representation of the levels at the barycenter of the individuals is not the standard output of softwares implementing MCA where the so-called quasi-barycentric property is usually used The quasi-barycentric property is simply the barycentric property written as follows : ( ) asα = 1 1 f iα λα n s i:k is =1 This reads : each coordinate asα is the mean value of the factor coordinates of the individuals that belong to level s, up to the multiplier coefficient 1 λ α Chapitre 2 MCA 34/52

35 It is then possible to plot the levels at the quasi-barycenter of the individuals : - the individuals are plotted according to their principal components f α, - the levels are plotted according to their factor coordinates vectors a α The representation of the levels at the quasi-barycenter of the individuals : - is the simultaneous representation usually implemented in the softwares, - must be interpreted as follows : the cloud of the levels is the dilatation (by in each dimension) of the cloud of the gravity centers of the 1 λ α individuals Chapitre 2 MCA 35/52

36 Example of the dogs data : Levels at the quasi barycenter of the individuals second PC BassetHound W++ V Mastiff S I Chihuahua Pekingese W GermanMastiff BullMastiff SaintBernard Newfoundland Bulldog Teckel Af Ag+ CockerSpaniel FoxTerrier FoxHound S++ Levrier Poodle GrdBleuGascon V++ I+ Ag Doberman Af+ Pointer Setter GermanShepherd I++ Beauceron FrenchSpaniel Collie W+ Boxer V+ BittanySpaniel Dalmatien Labrador S For instance the level W++ is plotted at the barycenter of the dogs that belong to W++ dilated by 1 = 1 λ = 2076 on the first dimension first PC res$eig[1:2,1] ## [1] apply(f[which(data$weight=="w++"),1:2],2,mean)/sqrt(res$eig[1:2,1]) ## Dim 1 Dim 2 ## Chapitre 2 MCA 36/52

37 Step 3 : the squared loadings processing step The contribution c jα of the variable x j (j-th column of X) to the variance of the principal component f α is defined by : n s c jα = n a 2 sα (7) s M j The matrix C = (c jα) of dimention p r is called the squared loadings matrix to draw an analogy with squared loadings in PCA 1 α r 1 j c jα p Each element c jα is equal to the correlation ratio between x j and f α : c jα = η 2 (f α x j ) Chapitre 2 MCA 37/52

38 C <- res$var$eta2 C[,1:2] The matrix C ## Dim 1 Dim 2 ## Size ## Weight ## Velocity ## Intelligence ## Affectivity ## Aggressivness Variables plotted according to their squared loadings plot(res,choix="var", cex=15,title="") Dim 2 (2308%) Intelligence Aggressivness Velocity Weight Affectivity Size Dim 1 (2890%) Chapitre 2 MCA 38/52

39 Plan 1 Basic notions 2 The MCA algorithm 3 Different implementations of MCA 4 Interpretation of the results Chapitre 2 MCA 39/52

40 3 Different implementations of MCA 1 Implement MCA as a CA of the Burt table : the anglo-saxon approach - CA is called simple Correspondance Analysis - In french CA is AFC = Analyse Factorielle des Correspondances - CA analyses a simple contingency table obtained by crossing two categorical variables - CA is a two steps procedure with first a PCA of the matrix of the row-profiles of the contingency table and then a PCA of the the matrix of the column-profiles These PCA use specific weights on the rows and columns and then specific metrics - Applying CA to the Burt table is then applying a single PCA (with specific metrics) to the matrix of the row-profiles of the Burt table Indeed the column-profiles are identical to the row-profiles in the Burt table Drawback : This algorithm gives the results (factor coordinates) for the levels but not for the individuals Implemented in the procedure CORRESP of the SAS sofware Chapitre 2 MCA 40/52

41 The Burt table is a symmetric table of size m m which gathers the contingency tables of all the pairs of variables 1 s m 1 B = K t K = s b ss m where : - b ss = n kisk i=1 is is the number of individual having both levels s and s - b ss = n s is the number of individuals having s Chapitre 2 MCA 41/52

42 Example : Burt table of the m = 16 levels K <- tabdisjonctif(data) B <- t(k)%*%k print(b) ## S- S+ S++ W- W+ W++ V- V+ V++ I- I+ I++ Af- Af+ Ag- Ag+ ## S ## S ## S ## W ## W ## W ## V ## V ## V ## I ## I ## I ## Af ## Af ## Ag ## Ag Chapitre 2 MCA 42/52

43 2 Implement MCA as a CA of the disjonctive table : the standard approach - The disjonctive table is used as a contingency table - Applying CA to the disjonctive table is then a two steps procedure with first a PCA of the matrix of the row-profiles (of the individuall) and then a PCA of the the matrix of the column-profiles (of the levels) Advantage : This algorithm gives directly the results (factor coordinates) for the levels and for the individuals Implemented in the function MCA of the R package FactoMineR Chapitre 2 MCA 43/52

44 3 Perform a PCA of the disjonctive table : the single PCA approach - This PCA uses specific weights of the columns (the levels) and then a specific distance between two rows (individuals) - Compared to the standard approach : - the factor coordinates of the levels are the same - the factor coordinates of the individuals are multiplied by p - the total inertia is multiplied by p and is equal to m p Advantage : It is not necessary to know the CA method to understand this algorithm Implemented in the function PCAmix of the R package PCAmixdata Chapitre 2 MCA 44/52

45 Plan 1 Basic notions 2 The MCA algorithm 3 Different implementations of MCA 4 Interpretation of the results Chapitre 2 MCA 45/52

46 4 Interpretation of the results Quality of the dimension reduction The quality of the q first principal components is measured by the proportion of the inertia that they explain Inertia of the data : I(Z) = I(F) = λ λ r = m p Proportion of inertia explaine by the α-th principal component λ α λ λ r In MCA, the percentage of inertia explained by the axes are "small" by construction Some authors have proposed corrections of the eigenvalues in MCA (Greenacre, 1993) Chapitre 2 MCA 46/52

47 Original data (p = 6 et m=16) ## Size Weight Velocity Intelligence ## Beauceron S++ W+ V++ I+ ## BassetHound S- W- V- I- ## GermanShepherd S++ W+ V++ I++ ## Boxer S+ W+ V+ I+ ## Bulldog S- W- V- I+ Reduction to the 3 first PCs ## Dim 1 Dim 2 Dim 3 ## Beauceron ## BassetHound ## GermanShepherd ## Boxer ## Bulldog What is the quality of this reduction? ## Eigenvalue Proportion Cumulative ## dim ## dim ## dim ## dim ## dim ## dim ## dim ## dim ## dim ## dim r = 10 non nul eigenvalues because r = min(n 1, m p) = 10, - The sum of the eigenvalues is m p = 10 (total inertia), % of the inertia is exaplined by the 3 first PCs Chapitre 2 MCA 47/52

48 Contribution of the individuals and of the levels - The relative contribution of an individual i to the variance of an axe α is : 1 fiα 2 n λ α The individuals far from the center of the factor map are those who contribute the most They can be a source of instability and can be removed or used as illustrative - The relative contribution of a level s to the variance of an axe α is : n s asα 2 n λ α The levels far from the center of the factor map are not necessary those who contribute the most Chapitre 2 MCA 48/52

49 The 5 individuals which contribute the most The 5 levels which contribute the most Dim 2 (2308%) Mastiff Pekingese Chihuahua Dalmatien Labrador Dim 2 (2308 %) W+ V S+ S W Dim 1 (2890%) Dim 1 (289 %) Chapitre 2 MCA 49/52

50 Contribution of the variables The absolute contribution of a categorical variable j to the variance of an axe α is the sum of the contributions of its levels : n s n a2 sα = η 2 (f α x j ) s M j The correlation ratios are signless measure of links used to plot the categorical variables on a map Dim 2 (2308 %) Weight Velocity Size Intelligence Affectivity Aggressivness Dim 2 (2308%) S S+ S++ BassetHound Mastiff SaintBernard BullMastiff Newfoundland GermanMastiff CockerSpaniel FoxHound GrdBleuGascon Levrier Doberman Setter Beauceron Pointer GermanShepherd CollieFrenchSpaniel Chihuahua Pekingese Bulldog Teckel FoxTerrier Poodle Boxer Dalmatien Labrador BittanySpaniel Dim 1 (289 %) Dim 1 (2890%) Chapitre 2 MCA 50/52

51 Quality of the projection of the individuals and of the level The quality of the projection of the individuals or of the levels is measured as in PCA by the so-called squared cosine - If two individuals are well projected their distance on the factor map is not far from their true distance knowing that in MCA the distance between two individuals is small if they have the same levels - If two levels are well projected, their distance on the factor map can be interpreted using the barycentric property : - two levels of two different variables are close if they are owned by the same individuals - two levels of a same variable are close if the two associated groups of individuals are close - Take care of the dispertion of the individuals associated with each levels before interpretating of the proximity between two levels Chapitre 2 MCA 51/52

52 The 10 individuals best projected Levels having a cos 2 > 05 Dim 2 (2308%) Mastiff GermanMastiff BassetHound Chihuahua Pekingese Bulldog Teckel Dalmatien Labrador BittanySpaniel Dim 2 (2308 %) Af S++ W+ V Af+ S+ S W Dim 1 (2890%) Dim 1 (289 %) Chapitre 2 MCA 52/52

Chapter II Multiple Correspondance Analysis (MCA)

Chapter II Multiple Correspondance Analysis (MCA) Chapter II Multiple Correspodace Aalysis (MCA) Master MMAS - Uiversity of Bordeaux Marie Chavet Chapitre 2 MCA 1/52 Itroductio How to get iformatio from a categorical data table of idividuals variables?

More information

MULTIVARIATE ANALYSIS USING R

MULTIVARIATE ANALYSIS USING R MULTIVARIATE ANALYSIS USING R B N Mandal I.A.S.R.I., Library Avenue, New Delhi 110 012 bnmandal @iasri.res.in 1. Introduction This article gives an exposition of how to use the R statistical software for

More information

Data Mining - Data. Dr. Jean-Michel RICHER Dr. Jean-Michel RICHER Data Mining - Data 1 / 47

Data Mining - Data. Dr. Jean-Michel RICHER Dr. Jean-Michel RICHER Data Mining - Data 1 / 47 Data Mining - Data Dr. Jean-Michel RICHER 2018 jean-michel.richer@univ-angers.fr Dr. Jean-Michel RICHER Data Mining - Data 1 / 47 Outline 1. Introduction 2. Data preprocessing 3. CPA with R 4. Exercise

More information

k Nearest Neighbors Super simple idea! Instance-based learning as opposed to model-based (no pre-processing)

k Nearest Neighbors Super simple idea! Instance-based learning as opposed to model-based (no pre-processing) k Nearest Neighbors k Nearest Neighbors To classify an observation: Look at the labels of some number, say k, of neighboring observations. The observation is then classified based on its nearest neighbors

More information

Feature Selection Using Modified-MCA Based Scoring Metric for Classification

Feature Selection Using Modified-MCA Based Scoring Metric for Classification 2011 International Conference on Information Communication and Management IPCSIT vol.16 (2011) (2011) IACSIT Press, Singapore Feature Selection Using Modified-MCA Based Scoring Metric for Classification

More information

Data Mining: Exploring Data. Lecture Notes for Chapter 3

Data Mining: Exploring Data. Lecture Notes for Chapter 3 Data Mining: Exploring Data Lecture Notes for Chapter 3 Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler Look for accompanying R code on the course web site. Topics Exploratory Data Analysis

More information

Work 2. Case-based reasoning exercise

Work 2. Case-based reasoning exercise Work 2. Case-based reasoning exercise Marc Albert Garcia Gonzalo, Miquel Perelló Nieto November 19, 2012 1 Introduction In this exercise we have implemented a case-based reasoning system, specifically

More information

Analysis and Latent Semantic Indexing

Analysis and Latent Semantic Indexing 18 Principal Component Analysis and Latent Semantic Indexing Understand the basics of principal component analysis and latent semantic index- Lab Objective: ing. Principal Component Analysis Understanding

More information

ClustOfVar: an R package for the clustering of variables

ClustOfVar: an R package for the clustering of variables Outline ClustOfVar: an R package for the clustering of variables Marie Chavent & Vanessa Kuentz & Benoît Liquet & Jérôme Saracco IMB, University of Bordeaux, France INRIA Bordeaux Sud-Ouest, CQFD Team

More information

Introduction to R and Statistical Data Analysis

Introduction to R and Statistical Data Analysis Microarray Center Introduction to R and Statistical Data Analysis PART II Petr Nazarov petr.nazarov@crp-sante.lu 22-11-2010 OUTLINE PART II Descriptive statistics in R (8) sum, mean, median, sd, var, cor,

More information

Linear discriminant analysis and logistic

Linear discriminant analysis and logistic Practical 6: classifiers Linear discriminant analysis and logistic This practical looks at two different methods of fitting linear classifiers. The linear discriminant analysis is implemented in the MASS

More information

Intro to R for Epidemiologists

Intro to R for Epidemiologists Lab 9 (3/19/15) Intro to R for Epidemiologists Part 1. MPG vs. Weight in mtcars dataset The mtcars dataset in the datasets package contains fuel consumption and 10 aspects of automobile design and performance

More information

Variable selection to construct indicators of quality of life for data structured in groups

Variable selection to construct indicators of quality of life for data structured in groups Variable selection to construct indicators of quality of life for data structured in groups Marie Chavent a,b, Vanessa Kuentz-Simonet c, Amaury Labenne c, Jérôme Saracco a,b a Univ. Bordeaux, IMB, UMR

More information

BL5229: Data Analysis with Matlab Lab: Learning: Clustering

BL5229: Data Analysis with Matlab Lab: Learning: Clustering BL5229: Data Analysis with Matlab Lab: Learning: Clustering The following hands-on exercises were designed to teach you step by step how to perform and understand various clustering algorithm. We will

More information

Data analysis case study using R for readily available data set using any one machine learning Algorithm

Data analysis case study using R for readily available data set using any one machine learning Algorithm Assignment-4 Data analysis case study using R for readily available data set using any one machine learning Algorithm Broadly, there are 3 types of Machine Learning Algorithms.. 1. Supervised Learning

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2012 http://ce.sharif.edu/courses/90-91/2/ce725-1/ Agenda Features and Patterns The Curse of Size and

More information

Package InPosition. R topics documented: February 19, 2015

Package InPosition. R topics documented: February 19, 2015 Package InPosition February 19, 2015 Type Package Title Inference Tests for ExPosition Version 0.12.7 Date 2013-12-09 Author Derek Beaton, Joseph Dunlop, Herve Abdi Maintainer Derek Beaton

More information

Predict Outcomes and Reveal Relationships in Categorical Data

Predict Outcomes and Reveal Relationships in Categorical Data PASW Categories 18 Specifications Predict Outcomes and Reveal Relationships in Categorical Data Unleash the full potential of your data through predictive analysis, statistical learning, perceptual mapping,

More information

Creating publication-ready Word tables in R

Creating publication-ready Word tables in R Creating publication-ready Word tables in R Sara Weston and Debbie Yee 12/09/2016 Has this happened to you? You re working on a draft of a manuscript with your adviser, and one of her edits is something

More information

Machine Learning: Algorithms and Applications Mockup Examination

Machine Learning: Algorithms and Applications Mockup Examination Machine Learning: Algorithms and Applications Mockup Examination 14 May 2012 FIRST NAME STUDENT NUMBER LAST NAME SIGNATURE Instructions for students Write First Name, Last Name, Student Number and Signature

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2013 http://ce.sharif.edu/courses/91-92/2/ce725-1/ Agenda Features and Patterns The Curse of Size and

More information

K-means Clustering & PCA

K-means Clustering & PCA K-means Clustering & PCA Andreas C. Kapourani (Credit: Hiroshi Shimodaira) 02 February 2018 1 Introduction In this lab session we will focus on K-means clustering and Principal Component Analysis (PCA).

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Features and Patterns The Curse of Size and

More information

Empirical comparison of a monothetic divisive clustering method with the Ward and the k-means clustering methods

Empirical comparison of a monothetic divisive clustering method with the Ward and the k-means clustering methods Empirical comparison of a monothetic divisive clustering method with the Ward and the k-means clustering methods Marie Chavent, Yves Lechevallier To cite this version: Marie Chavent, Yves Lechevallier.

More information

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information

Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms

Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science Bren School of Information and Computer Sciences University of California,

More information

Partitioning Cluster Analysis with Possibilistic C-Means Zeynel Cebeci

Partitioning Cluster Analysis with Possibilistic C-Means Zeynel Cebeci Partitioning Cluster Analysis with Possibilistic C-Means Zeynel Cebeci 2017-11-10 Contents 1 PREPARING FOR THE ANALYSIS 1 1.1 Install and load the package ppclust................................ 1 1.2

More information

General Instructions. Questions

General Instructions. Questions CS246: Mining Massive Data Sets Winter 2018 Problem Set 2 Due 11:59pm February 8, 2018 Only one late period is allowed for this homework (11:59pm 2/13). General Instructions Submission instructions: These

More information

Statistical Methods in AI

Statistical Methods in AI Statistical Methods in AI Distance Based and Linear Classifiers Shrenik Lad, 200901097 INTRODUCTION : The aim of the project was to understand different types of classification algorithms by implementing

More information

IBM SPSS Categories. Predict outcomes and reveal relationships in categorical data. Highlights. With IBM SPSS Categories you can:

IBM SPSS Categories. Predict outcomes and reveal relationships in categorical data. Highlights. With IBM SPSS Categories you can: IBM Software IBM SPSS Statistics 19 IBM SPSS Categories Predict outcomes and reveal relationships in categorical data Highlights With IBM SPSS Categories you can: Visualize and explore complex categorical

More information

STA 570 Spring Lecture 5 Tuesday, Feb 1

STA 570 Spring Lecture 5 Tuesday, Feb 1 STA 570 Spring 2011 Lecture 5 Tuesday, Feb 1 Descriptive Statistics Summarizing Univariate Data o Standard Deviation, Empirical Rule, IQR o Boxplots Summarizing Bivariate Data o Contingency Tables o Row

More information

Handling Missing Values with Regularized Iterative Multiple Correspondence Analysis

Handling Missing Values with Regularized Iterative Multiple Correspondence Analysis Handling Missing Values with Regularized Iterative Multiple Correspondence Analysis Julie Josse, Marie Chavent, Benoit Liquet, François Husson To cite this version: Julie Josse, Marie Chavent, Benoit Liquet,

More information

Lecture 3: Camera Calibration, DLT, SVD

Lecture 3: Camera Calibration, DLT, SVD Computer Vision Lecture 3 23--28 Lecture 3: Camera Calibration, DL, SVD he Inner Parameters In this section we will introduce the inner parameters of the cameras Recall from the camera equations λx = P

More information

arulescba: Classification for Factor and Transactional Data Sets Using Association Rules

arulescba: Classification for Factor and Transactional Data Sets Using Association Rules arulescba: Classification for Factor and Transactional Data Sets Using Association Rules Ian Johnson Southern Methodist University Abstract This paper presents an R package, arulescba, which uses association

More information

In this tutorial, we show how to implement this approach and how to interpret the results with Tanagra.

In this tutorial, we show how to implement this approach and how to interpret the results with Tanagra. Subject Implementing the Principal Component Analysis (PCA) with TANAGRA. The PCA belongs to the factor analysis approaches. It is used to discover the underlying structure of a set of variables. It reduces

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

Graphing Bivariate Relationships

Graphing Bivariate Relationships Graphing Bivariate Relationships Overview To fully explore the relationship between two variables both summary statistics and visualizations are important. For this assignment you will describe the relationship

More information

Part I. Graphical exploratory data analysis. Graphical summaries of data. Graphical summaries of data

Part I. Graphical exploratory data analysis. Graphical summaries of data. Graphical summaries of data Week 3 Based in part on slides from textbook, slides of Susan Holmes Part I Graphical exploratory data analysis October 10, 2012 1 / 1 2 / 1 Graphical summaries of data Graphical summaries of data Exploratory

More information

Cluster Analysis and Visualization. Workshop on Statistics and Machine Learning 2004/2/6

Cluster Analysis and Visualization. Workshop on Statistics and Machine Learning 2004/2/6 Cluster Analysis and Visualization Workshop on Statistics and Machine Learning 2004/2/6 Outlines Introduction Stages in Clustering Clustering Analysis and Visualization One/two-dimensional Data Histogram,

More information

Workload Characterization Techniques

Workload Characterization Techniques Workload Characterization Techniques Raj Jain Washington University in Saint Louis Saint Louis, MO 63130 Jain@cse.wustl.edu These slides are available on-line at: http://www.cse.wustl.edu/~jain/cse567-08/

More information

Instance-Based Representations. k-nearest Neighbor. k-nearest Neighbor. k-nearest Neighbor. exemplars + distance measure. Challenges.

Instance-Based Representations. k-nearest Neighbor. k-nearest Neighbor. k-nearest Neighbor. exemplars + distance measure. Challenges. Instance-Based Representations exemplars + distance measure Challenges. algorithm: IB1 classify based on majority class of k nearest neighbors learned structure is not explicitly represented choosing k

More information

CSE 547: Machine Learning for Big Data Spring Problem Set 2. Please read the homework submission policies.

CSE 547: Machine Learning for Big Data Spring Problem Set 2. Please read the homework submission policies. CSE 547: Machine Learning for Big Data Spring 2019 Problem Set 2 Please read the homework submission policies. 1 Principal Component Analysis and Reconstruction (25 points) Let s do PCA and reconstruct

More information

Modified-MCA Based Feature Selection Model for Preprocessing Step of Classification

Modified-MCA Based Feature Selection Model for Preprocessing Step of Classification Modified- Based Feature Selection Model for Preprocessing Step of Classification Myo Khaing and Nang Saing Moon Kham, Member IACSIT Abstract Feature subset selection is a technique for reducing the attribute

More information

Multiresponse Sparse Regression with Application to Multidimensional Scaling

Multiresponse Sparse Regression with Application to Multidimensional Scaling Multiresponse Sparse Regression with Application to Multidimensional Scaling Timo Similä and Jarkko Tikka Helsinki University of Technology, Laboratory of Computer and Information Science P.O. Box 54,

More information

Week 7 Picturing Network. Vahe and Bethany

Week 7 Picturing Network. Vahe and Bethany Week 7 Picturing Network Vahe and Bethany Freeman (2005) - Graphic Techniques for Exploring Social Network Data The two main goals of analyzing social network data are identification of cohesive groups

More information

mmpf: Monte-Carlo Methods for Prediction Functions by Zachary M. Jones

mmpf: Monte-Carlo Methods for Prediction Functions by Zachary M. Jones CONTRIBUTED RESEARCH ARTICLE 1 mmpf: Monte-Carlo Methods for Prediction Functions by Zachary M. Jones Abstract Machine learning methods can often learn high-dimensional functions which generalize well

More information

Dimension reduction : PCA and Clustering

Dimension reduction : PCA and Clustering Dimension reduction : PCA and Clustering By Hanne Jarmer Slides by Christopher Workman Center for Biological Sequence Analysis DTU The DNA Array Analysis Pipeline Array design Probe design Question Experimental

More information

STAT 1291: Data Science

STAT 1291: Data Science STAT 1291: Data Science Lecture 18 - Statistical modeling II: Machine learning Sungkyu Jung Where are we? data visualization data wrangling professional ethics statistical foundation Statistical modeling:

More information

DATA VISUALIZATION WITH GGPLOT2. Coordinates

DATA VISUALIZATION WITH GGPLOT2. Coordinates DATA VISUALIZATION WITH GGPLOT2 Coordinates Coordinates Layer Controls plot dimensions coord_ coord_cartesian() Zooming in scale_x_continuous(limits =...) xlim() coord_cartesian(xlim =...) Original Plot

More information

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki Wagner Meira Jr. Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA Department

More information

Clojure & Incanter. Introduction to Datasets & Charts. Data Sorcery with. David Edgar Liebke

Clojure & Incanter. Introduction to Datasets & Charts. Data Sorcery with. David Edgar Liebke Data Sorcery with Clojure & Incanter Introduction to Datasets & Charts National Capital Area Clojure Meetup 18 February 2010 David Edgar Liebke liebke@incanter.org Outline Overview What is Incanter? Getting

More information

MATH5745 Multivariate Methods Lecture 13

MATH5745 Multivariate Methods Lecture 13 MATH5745 Multivariate Methods Lecture 13 April 24, 2018 MATH5745 Multivariate Methods Lecture 13 April 24, 2018 1 / 33 Cluster analysis. Example: Fisher iris data Fisher (1936) 1 iris data consists of

More information

MSA220 - Statistical Learning for Big Data

MSA220 - Statistical Learning for Big Data MSA220 - Statistical Learning for Big Data Lecture 13 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Clustering Explorative analysis - finding groups

More information

AN IMPROVED HYBRIDIZED K- MEANS CLUSTERING ALGORITHM (IHKMCA) FOR HIGHDIMENSIONAL DATASET & IT S PERFORMANCE ANALYSIS

AN IMPROVED HYBRIDIZED K- MEANS CLUSTERING ALGORITHM (IHKMCA) FOR HIGHDIMENSIONAL DATASET & IT S PERFORMANCE ANALYSIS AN IMPROVED HYBRIDIZED K- MEANS CLUSTERING ALGORITHM (IHKMCA) FOR HIGHDIMENSIONAL DATASET & IT S PERFORMANCE ANALYSIS H.S Behera Department of Computer Science and Engineering, Veer Surendra Sai University

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Fabio G. Cozman - fgcozman@usp.br November 16, 2018 What can we do? We just have a dataset with features (no labels, no response). We want to understand the data... no easy to define

More information

Fitting Classification and Regression Trees Using Statgraphics and R. Presented by Dr. Neil W. Polhemus

Fitting Classification and Regression Trees Using Statgraphics and R. Presented by Dr. Neil W. Polhemus Fitting Classification and Regression Trees Using Statgraphics and R Presented by Dr. Neil W. Polhemus Classification and Regression Trees Machine learning methods used to construct predictive models from

More information

Data Mining: Exploring Data. Lecture Notes for Chapter 3

Data Mining: Exploring Data. Lecture Notes for Chapter 3 Data Mining: Exploring Data Lecture Notes for Chapter 3 1 What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include

More information

LaTeX packages for R and Advanced knitr

LaTeX packages for R and Advanced knitr LaTeX packages for R and Advanced knitr Iowa State University April 9, 2014 More ways to combine R and LaTeX Additional knitr options for formatting R output: \Sexpr{}, results='asis' xtable - formats

More information

Finding Clusters 1 / 60

Finding Clusters 1 / 60 Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering Clustering by Partitioning, e.g. k-means Density Based Clustering, e.g. DBScan Grid Based Clustering 1 / 60

More information

An Introduction to Cluster Analysis. Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs

An Introduction to Cluster Analysis. Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs An Introduction to Cluster Analysis Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs zhaoxia@ics.uci.edu 1 What can you say about the figure? signal C 0.0 0.5 1.0 1500 subjects Two

More information

Chapter 60 The STEPDISC Procedure. Chapter Table of Contents

Chapter 60 The STEPDISC Procedure. Chapter Table of Contents Chapter 60 Chapter Table of Contents OVERVIEW...3155 GETTING STARTED...3156 SYNTAX...3163 PROC STEPDISC Statement...3163 BYStatement...3166 CLASSStatement...3167 FREQStatement...3167 VARStatement...3167

More information

Modalities Additive coding Disjunctive coding z 2/z a b 0 c d Table 1: Coding of modalities Table 2: table Conti

Modalities Additive coding Disjunctive coding z 2/z a b 0 c d Table 1: Coding of modalities Table 2: table Conti Topological Map for Binary Data Mustapha. LEBBAH a,fouad. BADRAN b, Sylvie. THIRIA a;b a- CEDERIC, Conservatoire National des Arts et Métiers, 292 rue Saint Martin, 75003 Paris, France b- Laboratoire LODYC,

More information

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan, Steinbach, Kumar What is data exploration? A preliminary exploration of the data to better understand its characteristics.

More information

Introduction to Artificial Intelligence

Introduction to Artificial Intelligence Introduction to Artificial Intelligence COMP307 Machine Learning 2: 3-K Techniques Yi Mei yi.mei@ecs.vuw.ac.nz 1 Outline K-Nearest Neighbour method Classification (Supervised learning) Basic NN (1-NN)

More information

The Curse of Dimensionality

The Curse of Dimensionality The Curse of Dimensionality ACAS 2002 p1/66 Curse of Dimensionality The basic idea of the curse of dimensionality is that high dimensional data is difficult to work with for several reasons: Adding more

More information

Introduction to Machine Learning Prof. Anirban Santara Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Introduction to Machine Learning Prof. Anirban Santara Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Introduction to Machine Learning Prof. Anirban Santara Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture 14 Python Exercise on knn and PCA Hello everyone,

More information

Data Mining: Exploring Data. Lecture Notes for Data Exploration Chapter. Introduction to Data Mining

Data Mining: Exploring Data. Lecture Notes for Data Exploration Chapter. Introduction to Data Mining Data Mining: Exploring Data Lecture Notes for Data Exploration Chapter Introduction to Data Mining by Tan, Steinbach, Karpatne, Kumar 02/03/2018 Introduction to Data Mining 1 What is data exploration?

More information

Package PCADSC. April 19, 2017

Package PCADSC. April 19, 2017 Type Package Package PCADSC April 19, 2017 Title Tools for Principal Component Analysis-Based Data Structure Comparisons Version 0.8.0 A suite of non-parametric, visual tools for assessing differences

More information

Quick Start Guide Jacob Stolk PhD Simone Stolk MPH November 2018

Quick Start Guide Jacob Stolk PhD Simone Stolk MPH November 2018 Quick Start Guide Jacob Stolk PhD Simone Stolk MPH November 2018 Contents Introduction... 1 Start DIONE... 2 Load Data... 3 Missing Values... 5 Explore Data... 6 One Variable... 6 Two Variables... 7 All

More information

6 Subscripting. 6.1 Basics of Subscripting. 6.2 Numeric Subscripts. 6.3 Character Subscripts

6 Subscripting. 6.1 Basics of Subscripting. 6.2 Numeric Subscripts. 6.3 Character Subscripts 6 Subscripting 6.1 Basics of Subscripting For objects that contain more than one element (vectors, matrices, arrays, data frames, and lists), subscripting is used to access some or all of those elements.

More information

Exploratory Multivariate Analysis by Example Using R

Exploratory Multivariate Analysis by Example Using R Computer Science and Data Analysis Series Exploratory Multivariate Analysis by Example Using R Franc;ois Husson Sebastien Le Jerome Pages 0 ~~~,~~!~~"' Boca Raton London New York Contents P reface xi 1

More information

KTH ROYAL INSTITUTE OF TECHNOLOGY. Lecture 14 Machine Learning. K-means, knn

KTH ROYAL INSTITUTE OF TECHNOLOGY. Lecture 14 Machine Learning. K-means, knn KTH ROYAL INSTITUTE OF TECHNOLOGY Lecture 14 Machine Learning. K-means, knn Contents K-means clustering K-Nearest Neighbour Power Systems Analysis An automated learning approach Understanding states in

More information

netzen - a software tool for the analysis and visualization of network data about

netzen - a software tool for the analysis and visualization of network data about Architect and main contributor: Dr. Carlos D. Correa Other contributors: Tarik Crnovrsanin and Yu-Hsuan Chan PI: Dr. Kwan-Liu Ma Visualization and Interface Design Innovation (ViDi) research group Computer

More information

Application of Fuzzy Logic Akira Imada Brest State Technical University

Application of Fuzzy Logic Akira Imada Brest State Technical University A slide show of our Lecture Note Application of Fuzzy Logic Akira Imada Brest State Technical University Last modified on 29 October 2016 (Contemporary Intelligent Information Techniques) 2 I. Fuzzy Basic

More information

2 Second Derivatives. As we have seen, a function f (x, y) of two variables has four different partial derivatives: f xx. f yx. f x y.

2 Second Derivatives. As we have seen, a function f (x, y) of two variables has four different partial derivatives: f xx. f yx. f x y. 2 Second Derivatives As we have seen, a function f (x, y) of two variables has four different partial derivatives: (x, y), (x, y), f yx (x, y), (x, y) It is convenient to gather all four of these into

More information

Iterated Consensus Clustering: A Technique We Can All Agree On

Iterated Consensus Clustering: A Technique We Can All Agree On Iterated Consensus Clustering: A Technique We Can All Agree On Mindy Hong, Robert Pearce, Kevin Valakuzhy, Carl Meyer, Shaina Race Abstract Cluster Analysis is a field of Data Mining used to extract underlying

More information

vector space retrieval many slides courtesy James Amherst

vector space retrieval many slides courtesy James Amherst vector space retrieval many slides courtesy James Allan@umass Amherst 1 what is a retrieval model? Model is an idealization or abstraction of an actual process Mathematical models are used to study the

More information

PCOMP http://127.0.0.1:55825/help/topic/com.rsi.idl.doc.core/pcomp... IDL API Reference Guides > IDL Reference Guide > Part I: IDL Command Reference > Routines: P PCOMP Syntax Return Value Arguments Keywords

More information

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany Syllabus Fri. 27.10. (1) 0. Introduction A. Supervised Learning: Linear Models & Fundamentals Fri. 3.11. (2) A.1 Linear Regression Fri. 10.11. (3) A.2 Linear Classification Fri. 17.11. (4) A.3 Regularization

More information

Multivariate analyses in ecology. Cluster (part 2) Ordination (part 1 & 2)

Multivariate analyses in ecology. Cluster (part 2) Ordination (part 1 & 2) Multivariate analyses in ecology Cluster (part 2) Ordination (part 1 & 2) 1 Exercise 9B - solut 2 Exercise 9B - solut 3 Exercise 9B - solut 4 Exercise 9B - solut 5 Multivariate analyses in ecology Cluster

More information

The STEPDISC Procedure

The STEPDISC Procedure SAS/STAT 9.2 User s Guide The STEPDISC Procedure (Book Excerpt) This document is an individual chapter from SAS/STAT 9.2 User s Guide. The correct bibliographic citation for the complete manual is as follows:

More information

Recommender System. What is it? How to build it? Challenges. R package: recommenderlab

Recommender System. What is it? How to build it? Challenges. R package: recommenderlab Recommender System What is it? How to build it? Challenges R package: recommenderlab 1 What is a recommender system Wiki definition: A recommender system or a recommendation system (sometimes replacing

More information

Classification with Diffuse or Incomplete Information

Classification with Diffuse or Incomplete Information Classification with Diffuse or Incomplete Information AMAURY CABALLERO, KANG YEN Florida International University Abstract. In many different fields like finance, business, pattern recognition, communication

More information

Machine Learning with MATLAB --classification

Machine Learning with MATLAB --classification Machine Learning with MATLAB --classification Stanley Liang, PhD York University Classification the definition In machine learning and statistics, classification is the problem of identifying to which

More information

Discriminate Analysis

Discriminate Analysis Discriminate Analysis Outline Introduction Linear Discriminant Analysis Examples 1 Introduction What is Discriminant Analysis? Statistical technique to classify objects into mutually exclusive and exhaustive

More information

Clustering analysis of gene expression data

Clustering analysis of gene expression data Clustering analysis of gene expression data Chapter 11 in Jonathan Pevsner, Bioinformatics and Functional Genomics, 3 rd edition (Chapter 9 in 2 nd edition) Human T cell expression data The matrix contains

More information

Dimensionality Reduction, including by Feature Selection.

Dimensionality Reduction, including by Feature Selection. Dimensionality Reduction, including by Feature Selection www.cs.wisc.edu/~dpage/cs760 Goals for the lecture you should understand the following concepts filtering-based feature selection information gain

More information

Applying the Possibilistic C-Means Algorithm in Kernel-Induced Spaces

Applying the Possibilistic C-Means Algorithm in Kernel-Induced Spaces 1 Applying the Possibilistic C-Means Algorithm in Kernel-Induced Spaces Maurizio Filippone, Francesco Masulli, and Stefano Rovetta M. Filippone is with the Department of Computer Science of the University

More information

Package catdap. R topics documented: March 20, 2018

Package catdap. R topics documented: March 20, 2018 Version 1.3.4 Title Categorical Data Analysis Program Package Author The Institute of Statistical Mathematics Package catdap March 20, 2018 Maintainer Masami Saga Depends R (>=

More information

Machine Learning. A. Supervised Learning A.7. Decision Trees. Lars Schmidt-Thieme

Machine Learning. A. Supervised Learning A.7. Decision Trees. Lars Schmidt-Thieme Machine Learning A. Supervised Learning A.7. Decision Trees Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University of Hildesheim, Germany 1 /

More information

In stochastic gradient descent implementations, the fixed learning rate η is often replaced by an adaptive learning rate that decreases over time,

In stochastic gradient descent implementations, the fixed learning rate η is often replaced by an adaptive learning rate that decreases over time, Chapter 2 Although stochastic gradient descent can be considered as an approximation of gradient descent, it typically reaches convergence much faster because of the more frequent weight updates. Since

More information

Lecture Topic Projects

Lecture Topic Projects Lecture Topic Projects 1 Intro, schedule, and logistics 2 Applications of visual analytics, basic tasks, data types 3 Introduction to D3, basic vis techniques for non-spatial data Project #1 out 4 Data

More information

The ca Package. October 29, 2007

The ca Package. October 29, 2007 Version 0.21 Date 2007-07-25 The ca Package October 29, 2007 Title Simple, Multiple and Joint Correspondence Analysis Author Michael Greenacre , Oleg Nenadic

More information

Data Exploration and Preparation Data Mining and Text Mining (UIC Politecnico di Milano)

Data Exploration and Preparation Data Mining and Text Mining (UIC Politecnico di Milano) Data Exploration and Preparation Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Jiawei Han and Micheline Kamber, "Data Mining, : Concepts and Techniques", The Morgan Kaufmann

More information

The Use of Biplot Analysis and Euclidean Distance with Procrustes Measure for Outliers Detection

The Use of Biplot Analysis and Euclidean Distance with Procrustes Measure for Outliers Detection Volume-8, Issue-1 February 2018 International Journal of Engineering and Management Research Page Number: 194-200 The Use of Biplot Analysis and Euclidean Distance with Procrustes Measure for Outliers

More information

Hands on Datamining & Machine Learning with Weka

Hands on Datamining & Machine Learning with Weka Step1: Click the Experimenter button to launch the Weka Experimenter. The Weka Experimenter allows you to design your own experiments of running algorithms on datasets, run the experiments and analyze

More information

Package TExPosition. R topics documented: January 31, 2019

Package TExPosition. R topics documented: January 31, 2019 Package TExPosition January 31, 2019 Type Package Title Two-Table ExPosition Version 2.6.10.1 Date 2013-12-09 Author Derek Beaton, Jenny Rieck, Cherise R. Chin Fatt, Herve Abdi Maintainer Derek Beaton

More information

Version 2.4 of Idiogrid

Version 2.4 of Idiogrid Version 2.4 of Idiogrid Structural and Visual Modifications 1. Tab delimited grids in Grid Data window. The most immediately obvious change to this newest version of Idiogrid will be the tab sheets that

More information

CSE 252B: Computer Vision II

CSE 252B: Computer Vision II CSE 252B: Computer Vision II Lecturer: Serge Belongie Scribe: Haowei Liu LECTURE 16 Structure from Motion from Tracked Points 16.1. Introduction In the last lecture we learned how to track point features

More information

University of Florida CISE department Gator Engineering. Visualization

University of Florida CISE department Gator Engineering. Visualization Visualization Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida What is visualization? Visualization is the process of converting data (information) in to

More information