Statistics 202: Statistical Aspects of Data Mining

Size: px

Start display at page:

Download "Statistics 202: Statistical Aspects of Data Mining"

Toby Little
5 years ago
Views:

1 Statistics 202: Statistical Aspects of Data Mining Professor Rajan Patel Lecture 11 = Chapter 8 Agenda: 1)Reminder about final exam 2)Finish Chapter 5 3)Chapter 8 1

2 Class Project The class project is due on August 15 th at 11:59 PM. If you turn it in early, I will try to grade it within the next 48 hours so you have an idea of whether you should take the final. Please submit your relevance predictions on the test set as well. 2

3 Introduction to Data Mining by Tan, Steinbach, Kumar Chapter 5: Classification: Alternative Techniques

4 Naive Bayes Classifier (Section 5.3.3, page 231) The naive Bayes classifier assumes that the x attributes are conditionally independent given the class attribute y Thus, P(Y X) = P(Y) * P(X Y) / P(X) = P(Y) * P(X 1 Y)*. * P(X d Y) / P(X) Then for any x you choose the class y that gives you the largest numerator You estimate the P(X i Y) values based on the data (see next slide)

5 How to Estimate the P(X i Y) For categorical x s, just use counts (although some people modify this to fix problems with zero or small counts, see page 236) For continuous x s, fit some distribution function. The normal distribution using the observed sample mean and observed sample standard deviation is popular The normal probability density function is given by p(x) = 1 e 2πσ (1/2)[(X μ)/σ] where μ is the mean and σ is the standard deviation 2

6 10 Example of the Naive Bayes Classifier For this data, use naive Bayes to classify an observation with X = ( Refund = No, Married,Income = 120K) Tid Refund Marital Status Taxable Income Evade 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes

7 10 Example of the Naive Bayes Classifier For this data, use naive Bayes to classify an observation with X = ( Refund = No, Married,Income = 120K) P(refund=yes yes)=0/3 P(refund=no yes)=3/3 P(refund=yes no)=3/7 P(refund=no no)=4/7 P(ms=single yes)=2/3 P(ms=single no)=2/7 P(ms=divorced yes)=1/3 P(ms=divorced no)=1/7 P(ms=married yes)=0/3 P(ms=married no)=4/7 Given yes, Given no, income has mean=90 income has mean=110 and sd=5 and sd=54.4 P(120 yes)= P(120 no)= dnorm(120,90,5)= dnorm(120,110,54.4)= 1.2* Tid Refund Marital Status Taxable Income 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No Evade 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes P(yes X) = 3/10*1/P(X)*3/3*0/3*1.2*10-9 < P(no X) = 7/10*1/P(X) *4/7*4/7*.0072 So we classify this X as NO

8 Introduction to Data Mining by Tan, Steinbach, Kumar Chapter 8: Cluster Analysis 8

9 What is Cluster Analysis? Cluster analysis divides data into groups (clusters) that are meaningful, useful, or both (page 487) It is similar to classification, only now we don t know the answer (we don t have the labels) For this reason, clustering is often called unsupervised learning while classification is often called supervised learning (page 491 but the book says classification instead of learning ) Note that there also exists semi-supervised learning which is a combination of both and is a hot research area right now 9

10 What is Cluster Analysis? Because there is no right answer, your book characterizes clustering as an exercise in descriptive statistics rather than prediction Cluster analysis groups data objects based only on information found in the data that describes the objects and their similarities (page 490) The goal is that objects within a group be similar (or related) to one another and different from (or unrelated to) the objects in other groups (page 490) 10

11 Examples of Clustering (P. 488) Information Retrieval: search engine documents = cluster documents about the same person Climate: Clusters = regions of similar climate Psychology and Medicine: patterns in spatial or temporal distribution of a disease Business: Segment customers into groups for marketing activities 11

12 Two Reasons for Clustering (P. 488) Clustering for Understanding (see examples from previous slide) Clustering for Utility -Summarizing: different algorithms can run faster on a data set summarized by clustering -Compression: storing cluster information is more efficient that storing the entire data - example: quantization -Finding Nearest Neighbors 12

13 How Many Clusters is Tricky/Subjective How many clusters? 13

14 How Many Clusters is Tricky/Subjective How many clusters? Two Clusters 14

15 How Many Clusters is Tricky/Subjective How many clusters? Two Clusters Four Clusters 15

16 How Many Clusters is Tricky/Subjective How many clusters? Six Clusters Two Clusters Four Clusters 16

17 K-Means Clustering K-means clustering is one of the most common/popular techniques Each cluster is associated with a centroid (center point) this is often the mean it is the cluster prototype Each point is assigned to the cluster with the closest centroid The number of clusters, K, must be specified ahead of time 17

18 K-Means Clustering The most common version of k-means minimizes the sum of the squared distances of each point from its cluster center (page 500) K 2 SSE = dist ( c, x) i= 1 x C For a given set of cluster centers, (obviously) each point should be matched to the nearest center i i For a given cluster, the best center is the mean The basic algorithm is to iterate over these two relationships 18

19 K-Means Clustering Algorithms This is Algorithm 8.1 on page 497 of your text Other algorithms also exist In R, the function kmeans() does k means clustering no special package or library is needed 19

20 In class exercise #41: Use kmeans() in R with all the default values to find the k=2 solution for the 2-dimensional data at Plot the data. Also plot the fitted cluster centers using a different color. Color the points according to their cluster membership. 20

21 In class exercise #41: Use kmeans() in R with all the default values to find the k=2 solution for the 2-dimensional data at Plot the data. Also plot the fitted cluster centers using a different color. Color the points according to their cluster membership. Solution: x<-read.csv("cluster.csv",header=f) plot(x,pch=19,xlab=expression(x[1]), ylab=expression(x[2])) fit<-kmeans(x, 2) points(fit$centers,pch=19,col="blue",cex=2) 21

22 In class exercise #41: Use kmeans() in R with all the default values to find the k=2 solution for the 2-dimensional data at Plot the data. Also plot the fitted cluster centers using a different color. Color the points according to their cluster membership. Solution (continued): points(x,col=fit$cluster,pch=19) 22

23 In class exercise #41: Use kmeans() in R with all the default values to find the k=2 solution for the 2-dimensional data at Plot the data. Also plot the fitted cluster centers using a different color. Color the points according to their cluster membership. Solution (continued): 23

24 In class exercise #42: Use kmeans() in R with all the default values to find the k=2 solution for the first two columns of the sonar training data at Plot these two columns. Also plot the fitted cluster centers using a different color. Color the points according to their cluster membership. 24

25 In class exercise #42: Use kmeans() in R with all the default values to find the k=2 solution for the first two columns of the sonar training data at Plot these two columns. Also plot the fitted cluster centers using a different color. Color the points according to their cluster membership. Solution: data<-read.csv("sonar_train.csv",header=false) x<-data[,1:2,] plot(x,pch=19,xlab=expression(x[1]), ylab=expression(x[2])) 25

26 In class exercise #42: Use kmeans() in R with all the default values to find the k=2 solution for the first two columns of the sonar training data at Plot these two columns. Also plot the fitted cluster centers using a different color. Color the points according to their cluster membership. Solution (continued): fit<-kmeans(x, 2) points(fit$centers,pch=19,col="blue",cex=2) 26

27 In class exercise #42: Use kmeans() in R with all the default values to find the k=2 solution for the first two columns of the sonar training data at Plot these two columns. Also plot the fitted cluster centers using a different color. Color the points according to their cluster membership. Solution (continued): points(x,col=fit$cluster,pch=19) 27

28 In class exercise #42: Use kmeans() in R with all the default values to find the k=2 solution for the first two columns of the sonar training data at Plot these two columns. Also plot the fitted cluster centers using a different color. Color the points according to their cluster membership. Solution (continued): 28

29 In class exercise #43: Graphically compare the cluster memberships from the previous problem to the actual labels in the training data. 29

30 In class exercise #43: Graphically compare the cluster memberships from the previous problem to the actual labels in the training data. Solution: plot(x,pch=19,xlab=expression(x[1]), ylab=expression(x[2])) y<-data[,61] points(x,col=2+2*y,pch=19) x x 1

31 In class exercise #44: For the previous exercise compute the misclassification error that would result if you used your clustering rule to classify the data. 31

32 In class exercise #44: For the previous exercise compute the misclassification error that would result if you used your clustering rule to classify the data. Solution: sum(fit$cluster*2-3==y)/length(y) (the *2-3 part just forces both vectors to use the same labels) 32

33 In class exercise #45: Repeat the previous exercise using all 60 columns. 33

34 In class exercise #45: Repeat the previous exercise using all 60 columns. Solution: x<-data[,1:60,] fit<-kmeans(x, 2) sum(fit$cluster*2-3==y)/length(y) 34

35 In class exercise #46: Consider the one-dimensional data set given by x<-c(1,2,3,5,6,7,8) (I left out 4 on purpose). Starting with initial cluster center values of 1 and 2 carry out algorithm 8.1 until convergence by hand for k=2 clusters. 35

36 In class exercise #47: Repeat the previous exercise by writing a loop in R and verify that the final answer is the same. 36

37 In class exercise #47: Repeat the previous exercise by writing a loop in R and verify that the final answer is the same. Solution: x<-c(1,2,3,5,6,7,8) center1<-1 center2<-2 for (k in 2:10){ cluster1<-x[abs(x-center1[k-1])<=abs(x-center2[k- 1])] cluster2<-x[abs(x-center1[k-1])>abs(x-center2[k-1])] center1[k]<-mean(cluster1) center2[k]<-mean(cluster2) } 37

38 In class exercise #48: Verify that the kmeans function in R gives the same solution for the previous exercise when you use all of the default values. 38

39 In class exercise #48: Verify that the kmeans function in R gives the same solution for the previous exercise when you use all of the default values. Solution: kmeans(x,2) 39

40 Measuring Distance Many of the techniques for clustering and classification rely on some notion of distance Section 2.4 in the book discusses different ways of measuring distance (dissimilarity) For numeric variables, the distance you are used to is called Euclidean distance, but other methods exist For categorical variables or mixtures of categorical and numeric variables it is tricky to compute distance Remember, scaling is important if scales differ 40

41 Euclidean Distance (P.69) Euclidean distance is the usual method of computing distance that you are used to In 1 dimension it is the absolute value In 2 dimensions it is the Pythagorean Theorem In more than 2 dimensions it is just a generalization of the Pythagorean Theorem dist = In R, the function dist() computes distances n k= 1 ( p k q k 2 ) 41

42 In class exercise #49: Compute the distance between the points c(2,2) and c(5,7) by hand and verify that the function dist in R gives the same value. 42

43 In class exercise #49: Compute the distance between the points c(2,2) and c(5,7) by hand and verify that the function dist in R gives the same value. Solution: x1<-c(2,2) x2<-c(5,7) data<-matrix(c(x1,x2),nrow=2,byrow=t) dist(data) 43

44 In class exercise #50: Compute the distance between the points c(2,2,3) and c(5,7,10) by hand and verify that the function dist in R gives the same value. 44

45 In class exercise #50: Compute the distance between the points c(2,2,3) and c(5,7,10) by hand and verify that the function dist in R gives the same value. Solution: x1<-c(2,2,3) x2<-c(5,7,10) data<-matrix(c(x1,x2),nrow=2,byrow=t) dist(data) 45

46 Cosine Similarity (P.75) This is a common measure for computing the similarity between two documents by using their word counts for every possible word In practice, some normalization is used to account for different forms of the same word, differing document lengths, and different word frequencies The cosine similarity is defined as cos( d 1, d 2 ) = (d 1 d 2 ) / d 1 d 2 where indicates vector dot product and d is the norm of vector d This can be interpreted geometrically as the cosine of the angle between the vectors 46

47 Cosine Similarity Examples a=(2,5) b=(2,5) Cosine=1 a=(0,1) b=(0,7) Cosine=1 a=(0,1) b=(1,0) Cosine=0 a=(0,1) b=(1,1) Cosine=.71 a=(3,2,0,5,0,0,0,2,0,0) b=(1,0,0,0,0,0,0,1,0,2) Cosine=.31 a=(0,0,0,0,0,1,1,1,1,1) b=(1,1,1,1,1,1,1,1,1,1) Cosine=.71 47

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines DATA MINING LECTURE 10B Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines NEAREST NEIGHBOR CLASSIFICATION 10 10 Illustrating Classification Task Tid Attrib1