Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions

Size: px

Start display at page:

Download "Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions"

Corey Montgomery
5 years ago
Views:

1 Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions Thomas Giraud Simon Chabot October 12, 2013

2 Contents 1 Discriminant analysis Main idea The Bayes rule Maximum likelihood estimation Example Hypothesis Linear vs quadratic discriminant analysis Clustering methods Goal Kmeans Algorithm Example Iris classification Minibatch-kmeans Description Algorithm Hierarchical clustering Other methods K-nearest neighbors

3 Introduction This report aims to introduce some concepts about data classification. Classification algorithms are mainly divided up into two families. The first one is called supervised classification and the second one is called unsupervised classification. Each of them have the same goal : classify data into different classes according to their features. To achieve this, each algorithms family uses a different approach. The supervised family needs to be trained. It means that an algorithm of this family has to be trained on the same kind of data it will have to classify. For instance, if one of this algorithm has to classify crabs into two classes males, females we will have to give it a lot of males crabs and females crabs in order to it gets able to perform the classification on itself, and for each crabs, we have to say to the algorithm if the crabs is looking at is either male or female. Once the algorithm is trained, it will be able to work on its own. On the other hand, the unsupervised family can perform the classification on its own directly, there is no training at all. The algorithm is given a dataset and tries to recognize which subject belongs to which class, gathering subjects sharing similar properties. Usually, the only information the algorithm is given is the number of classes it has to find. This report introduces some famous methods of each family. 2

4 Chapter 1 Discriminant analysis 1.1 Main idea In this chapter, we introduce the basics of the decision theory and object classification thought the Bayes rule. The discriminant analysis goal is to classify a given population into different known classes. To achieve this goal, the classifier needs to be trained, let L be the train set. The population of L is made of n subjects, where each subject can be described by p features, and the class it belongs to. So, the whole population can be seen as the following matrix: x 11 x 1p y 1 x 21 x 2p y 2 (1.1).... x n1 x np y n Mathematically, we have to find a function of L, such that x g(x) = y. Now, let T be the test set, T looks like L but the classes (i.e. the last column) are unknown and have to be determined. The assumption made is that if L is well representative of T, then using the g(x) function over T, we might be able to associate to each subject the corresponding class. Let s see how such a function can be defined. 1.2 The Bayes rule Considering L, we have n subjects divided up into c classes. We assume each class k is distributed according to a normal distribution, N (µ k, Σ k ) and an a priori probability π k. Thus, we have: 3

5 f k (x) = ( 1 (2π) p/2 exp 1 ) det Σ k 2 (x µ k) Σ 1 k (x µ k ) (1.2) where p is the dimension of the x vector. The function g(x) we are going to build has to maximize the a posteriori probability of x to belong to the class k. That is to say, we are looking for: Thus, we define g(x) such as k = arg max π k f k (x) (1.3) k g : X Y = {w 1,..., w c } x w k where k is defined by (1.3) (1.4) The equation 1.4 is known as the Bayes rule. 1.3 Maximum likelihood estimation Now, given c normal distributions, we are able to classify a given subject x into the class maximizing the a posteriori probability. The problem is we do not know those c distributions. We assume they are normal ones, but we do not know exactly what the mean, the deviation and the a priori probability are. So, we have to estimate those parameters using the maximum likelihood estimation 1. Using this estimator and the train set L, we can be nearly the real and theoretical values. We have : ˆπ k = n k (1.5a) n ˆµ k = 1 n t ik x i (1.5b) n k ˆΣ k = 1 n k i=0 n t ik (x i ˆµ k ) (x i ˆµ k ) i=0 (1.5c) where n is the number of subjects in the whole population, n k the number of subjects belonging to the class w k, and t ik is a 1 if y i = w k, 0 otherwise, k {1,..., c}. 1. Some referred as MLE. 4

6 1.4 Example Before continuing on the discriminant analysis, let s have a little example. We want to make a programme, using discriminant analysis, to determine the origin of different wines, according to their chemical components. We found on a database 2 a record of 178 different wines, described by 12 features, and divided up into three classes. The idea is to divide this data into two sets : one train set L, used to train the classifier. one test set T, used to check how good (or bad) our classifier is. For each subject of T, we will ask our classifier to pick the best class to assign and compare this to the original value. By doing this for the whole set, we are going to be able to determine a percentage of success. For each subject, we have the following features : Malic acid, Ash, Alcalinity of ash, Magnesium, Total phenols, Flavanoids, Nonflavanoid phenols, Proanthocyanins, Color intensity, Hue, OD280/OD315 of diluted wines, Proline, and the class. We will show, step by step, how the classifier is trained, and used to classifier. First of all, we need to create the train and the test sets. 0 >>> import data # used to load the data >>> import pylab as p # used to perform some mathematical stuff 2 >>> w = load ( wine ) # load the raw wine data >>> print ( w) # here, we find our population matrix 4 [[ , ] [ , ] 6 [ , ]..., 8 [ , ] [ , ] 10 [ , ]] >>> p. shuffle (w) #we shuffle the rows 12 >>> train = w[: len ( w)/2,] # the first half is used as train set >>> test = w[ len (w)/2:, : -1] # the second half used as test set 14 # moreover, the last column is removed >>> target = w[ len ( w)/2:, -1]# keep the real classes in memory So, at this stage, we have train a matrix representing L, test representing T, and target the real classes of subjects of T, used to evaluate our classifier. We wrote a function, estimates_parameters, which evaluates for each class, the parameters of the normal distribution using the MLE on the train set. (Unfortunately, the output cannot be shown in the report, because it is constituted of three matrices and three 1 13 vectors... ). The parameters estimated are

7 stored into a variable params. Now, let s use our classifier on the test set. We have made a function called postprobalitiy and given one distribution (i.e. ˆµ k, ˆΣ k and ˆπ k ), this function gives 3 the a posteriori probability of x to belong to the class k. Let s classify one subject using this function : 0 >>> params = estimates_ parameters ( train ) # we train our classifier >>> subject = test [0, :] # Pick the first wine to classifier 2 >>> postprobalitiy ( subject, params [ 0]) # belonging to class >>> postprobalitiy ( subject, params [ 1]) # belonging to class >>> postprobalitiy ( subject, params [ 2]) # belonging to class Using the Bayes rule (1.4), the subject should be associated to the class 1. Let s check if it is right using the target array (which contains the actual classes). 0 >>> target [0] # The real class of subject 1 Yes, it is! To perform this operation on the whole test set, we have made a function called predict, which returns an array listing all the class assigned. 0 >>> predictions = predict ( test, params ) >>> print ( predictions ) # What has been predicted? 2 array ([1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 2, 0, 2, 2, 0, 0, 1, 0, 0, 0, 1, 2, 2, 0, 2, 2, 2, 0, 0, 4 0, 1, 1, 1, 0, 1, 0, 2, 1, 0, 2, 1, 1, 2, 1, 2, 0, 1, 1, 2, 2, 2, 1, 1, 1, 1, 0, 1, 1, 1, 6 1, 0, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 0, 0, 2, 0, 0, 1, 1, 1, 1, 0, 2, 0, 1, 2]) 8 >>> print ( target ) # What is wanted? array ([1, 0, 0, 1, 2, 1, 2, 1, 1, 1, 2, 2, 0, 2, 2, 10 0, 0, 1, 0, 0, 0, 1, 2, 2, 0, 2, 2, 2, 0, 0, 0, 1, 1, 2, 0, 1, 0, 2, 1, 0, 2, 1, 1, 2, 1, 12 2, 0, 1, 1, 2, 2, 2, 1, 2, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 0, 0, 1, 14 0, 0, 2, 0, 0, 1, 1, 1, 1, 0, 2, 0, 1, 2]) 16 >>> print ( predictions == target ) # Whether the prediction # is correct or not 18 array ([ True, False, True, True, False, True, False, True, True, True, False, True, True, True, 20 True, True, True, True, True, True, True, True, True, True, True, True, True, True, 22 True, True, True, True, True, False, True, 3. actually, it gives the logarithm of the probability, to handle small values in practice. This is equivalent. 6

8 True, True, True, True, True, True, True, 24 True, True, True, True, True, True, True, True, True, True, True, False, True, True, 26 True, True, True, False, True, True, True, True, True, True, True, True, True, True, 28 True, True, False, False, True, True, True, True, True, True, True, True, True, True, 30 True, True, True, True, True ], dtype = bool ) Using the target and the predictions tables, we can rate our classifier, counting the amount of correct predictions. In this particular case, the classifier is 89.89% correct (80 correct out of 89). 1.5 Hypothesis There exists different common variants of discriminant analysis. The one we have introduce so far is called the quadratic discriminant analysis. As a matter of fact, it can be shown quite easily that the decision borders are quadratic forms, such as hypersphere, hyperboloid, etc. Depending on the hypothesis we made on the data we have, the discriminant analysis can be declined into different variants. Linear discriminant analysis In this variant, we suppose that all the classes have the same covariance matrix, i.e. k Σ k = Σ. It can be shown that the decision borders of such a variant are linear manifolds. This variant assign to a subject x the nearest class (representing by its mean µ k ) according to the Mahalanobis distance 4. Naive discriminant analysis If one suppose that the p features describing the dataset are statistically independent, then she is performing a naive discriminant analysis. In practice, it means the covariance matrices are diagonal. Using the MLE, it means that only the diagonal of ˆΣ k is kept. Euclidean discriminant analysis This is the simplest variant there exists. One suppose that all the covariance matrices are equal to a scalar and the c classes have the same a priori probabilities. It means : Σ k = σ 2 I p and π k = 1 k {1,..., c} c 4. The Mahalanobis distance is defined as d(x, y) = (x y) Σ 1 (x y) and unlike the Euclidean one, it takes into account the dispersion of the distribution. 7

In this particular case, the decision borders are separated by hyperplanes. 1.5.

9 In this particular case, the decision borders are separated by hyperplanes Linear vs quadratic discriminant analysis Let s finish this chapter we a little example showing the difference between the linear discriminant analysis LDA and the quadratic discriminant analysis QDA. We have a dataset of 200 crabs made of 100 males and 100 females and we want to perform a discrimination based on the sex of the animal. Each crabs is described by five features 5. But, because we want to visualize the discrimination on a plane, we perform a principal component analysis in order to reduce the dimension while keeping the maximum of information. Once this is done, we perform a LDA and a QDA, the results are shown on the picture 1.1. Each color represents a sex. The big dots are the crabs assigned to the right class, and the small ones wrong one. The big black dots are the means, ellipsoids are the confident regions and the black lines the decision borders. We can see why there are called linear and quadratic discriminant analysis. Figure 1.1: Linear vs. Quadratic discriminant analysis 5. the frontal lobe size, the rear width, the carapace length, the carapace width and the body depth 8

10 Chapter 2 Clustering methods 2.1 Goal A cluster refers to a set of similar objects. The similarity in a set may vary according to data. The goal is to classify a given data set through a certain number of clusters. 2.2 Kmeans Algorithm K-means clustering is an simple unsupervised algorithm that solve the clustering problem. Its main characteristic is that it s told in advance how many distinct clusters to generate. The main idea is to determine the size of the clusters thanks to the structure of the data. The algorithm begins with k distinct placed centroids where each centroid stands for a cluster. Every point of the set is assigned to the nearest centroid by calculating the Euclidean distance between them. Then, the centroids become the average location of all the points assigned to them. And a second round begins : each distance between the nodes and the updated centroids is recalculated. The assignments are redone only if the nearest centroid of the point is not the one it currently belongs. When switching occurs, centroids have to be recalculated. This procedure is repeated until the assignments stop changing, in other words clusters do not move any more. The procedure always terminates as k-means will always converge. The number of iterations to converge is highly dependent on the initialization of the centroids. 9

11 Moreover, the main problem with this algorithm is its complexity. Suppose we have a dataset of records, and we want to divide them into 300 clusters. The complexity of the k-means algorithm is O(n k i f), where n is the number of data, k is the number of clusters, i is the number of repetitions and f is the number of features in a particular record. It s clearly that it will take a long time to cluster data. Finally, k-means algorithm s goal is to minimize an objective function which is a squared error function. J = k j=1 n x j i c j 2 (2.1) i=1 where x j i c j 2 is the Euclidean distance between the data point x j i and the centroid c j. The following steps describe the algorithm : 1. Place k different centroids randomly. 2. Assign each points to the nearest centroid. 3. Move centroids to the average location of the points that were assigned to them. 4. Repeat steps [2] and [3] until the centroids no longer move Example As a simple illustration of k-means algorithm, suppose that we have a data set of 200 points and we know that they can be grouped into 3 clusters. Figures 2.1 to 2.3 show the k-means process in action for this example Iris classification In this section, we will explain our programme, using k-means algorithm, to determine the clusters. It only works with two dimensions data. In this example, we use Fisher s Iris dataset 1. So, we have 150 Iris divided into 3 species, Iris setosa, Iris virginica and Iris versicolor. Each flower is described by 4 features. So, we need to reduce the dataset to 2 dimensions using a principal component analysis. 1. Fisher s Iris data set is a multivariate data set introduced by Sir Ronald Fisher (1936) as an example of discriminant analysis 10

12 Figure 2.1: Steps 1 to 2 11

13 Figure 2.2: Steps 3 to 4 12

14 Figure 2.3: Steps 5 to 6 13

15 The purpose is to retrieve these three different groups of iris. Therefore, we have to choose three different centroids randomly among iris data set. In this way, it allows us to not determine a correct size of the plane and we are sure that centroids will be close to data set. 0 from math import sqrt # used to perfom the square root from random import sample # used to perform random sampling without replacement 2 import pylab as p # used to perform some mathematical stuff 4 centroids = sample ( points, 3) # we choose 3 centroids randomly among points 6 for i in xrange (6) : # we repeat 6 times the algorithm 8 dict_ assign = dict () # it s a dictionary of assignements ( points and centroids ) 10 for s in centroids : dict_assign [s] = set () 12 for pt in points : 14 closest_ centroid = min ([( dist_ points ( pt, c), c) for c in centroids ]) [1] dict_assign [ closest_centroid ]. add (pt) 16 # assign the closest centroid to each point 18 centroids = set () for pts in dict_ assign. values (): 20 if not pts : continue 22 xm = sum (x for (x,y) in pts )/ len ( pts ) # average on abscissa of points assigned to the centroid ym = sum (y for (x,y) in pts )/ len ( pts ) # average on ordinate of points assigned to the centroid 24 centroids. add (( xm, ym)) # move centroids to the new average coordinates The results are presented in Figure 2.4. We can see that K-means classification correspond to the real classification. Expect for some points, the algorithm has some difficulties in making much difference between its clusters. The reason is that the actual classes are, in our PCA representation, overlapping. 14

16 2.3 Minibatch-kmeans Principle The idea of this k-means derived algorithm is to use smaller subsets of the data (mini-batches). For instance, for a dataset of records, we only train records. Thus, it takes lesser time than the original algorithm. Even if it uses minibatches, the algorithm makes sure that the clusters may be a good representation of whole of the dataset and the results are only slightly worse than the previous algorithm. All algorithms tend to have the issue of parameter selection. But in Mini-Batch we don t need to figure out how many clusters we want, only how many iterations to perform and the data size Algorithm In the first step, the algorithm takes S samples (randomly chosen from the dataset) which form a mini-batch. Then, samples are assigned to the nearest centroid. It then updates the cluster centroids by taking the average of the sample and the previous samples assigned to those centroids. This gradient descent update has the effect of decreasing the rate of change for a centroid, which is significantly faster than a normal k-means update. These steps are repeated until we reach a convergence or the number of iterations. 2.4 Hierarchical clustering An alternative method of clustering is Hierarchical clustering. This type of algorithm gives a tree as a result. Indeed, the principle is to build a hierarchy of clusters. There are two different representations: Agglomerative hierarchical clustering is a bottom-up approach where it starts with every single object in a single cluster. In each successive iteration, it merges the closest pair of clusters (by relying on some common criteria) until all of the data is in one cluster. Divisive clustering is a top-down approach. This variant of hierarchical clustering starts at the top with all documents in one cluster, and this cluster is split using a flat clustering algorithm 2. These splits are performed until each document is on its own cluster. The similarity between every pair of data must be recalculated in each iteration. That is why this algorithm runs slowly on large dataset. 2. documents within a cluster should be as similar as possible and documents in one cluster should be as dissimilar as possible from documents in other clusters 15

17 Figure 2.4: Iris classification 16

18 Conclusion 17

Stefano Cavuoti INAF Capodimonte Astronomical Observatory Napoli

Stefano Cavuoti INAF Capodimonte Astronomical Observatory Napoli By definition, machine learning models are based on learning and self-adaptive techniques. A priori, real world data are intrinsically carriers