University of Ghana Department of Computer Engineering School of Engineering Sciences College of Basic and Applied Sciences CPEN 405: Artificial Intelligence Lab 7 November 15, 2017 Unsupervised Learning Contents 1 Introduction 2 2 Background 3 2.1 Machine Learning....................................... 3 2.2 Supervised Learning..................................... 3 2.3 Unsupervised Learning Clustering............................. 3 2.4 Reinforcement Learning................................... 4 3 K-Means Clustering 5 3.1 Introduction.......................................... 5 3.2 The algorithm......................................... 6 3.3 Implementation........................................ 8 3.3.1 The Data class.................................... 8 3.3.2 The KMeans class................................... 8 3.3.3 The cluster function................................ 8 4 Challenge: Identifying the Black Pod disease 10 4.1 Requirements......................................... 11 5 Tasks 11 Page 1
1 Introduction Now we would like to see Artificial Intelligence in a different dimension: where an agent program can recognize patterns and learn. In the field of Artificial Intelligence, Machine Learning is now in the forefront, to such extents that it is applied in Natural Language Processing and Automated Reasoning. In this lab we shall see how to develop a program that is capable of categorizing data into groups on its own using the K-Means Clustering algorithm. We shall then apply this to detection of diseased plants in agriculture. 2 Background 2.1 Machine Learning Learning is the ability of an agent to improve its behavior based on observations made about the world. This could mean the following: The range of behaviors is expanded; the agent can do more. The accuracy on tasks is improved; the agent can do things better. The speed is improved; the agent can do things faster. There are three main types of learning which are described as follows. 2.2 Supervised Learning In supervised learning, the agent is presented with example input-output pairs and learns a function to map an input to an output such that when it is presented with new inputs, it can automatically determine the corresponding outputs. An abstract definition of supervised learning is as follows. Assume the learner is given the following data: a set of input features, X 1,..., X n ; a set of target features, Y 1,..., Y k ; a set of training examples, where the values for the input features and the target features are given for each example; and a set of test examples, where only the values for the input features are given. The aim is to predict the values of the target features for the test examples and as-yet-unseen examples. Typically, learning is the creation of a representation that can make predictions based on descriptions of the input features of new examples. As an example, take a spam filter. As you identify and mark emails as spam and allow others to pass as not-spam, it learns a model from the input (emails) to the output (spam or not-spam) based on which it can classify an incoming email. 2.3 Unsupervised Learning Clustering In unsupervised learning, the agent is simply presented with raw inputs and the agent learns patterns in the input. Most commonly the agent classifies the input into bins clustering.[?]. Thus in clustering or unsupervised learning, the target features are not given in the training examples. The aim is to construct a natural classification that can be used to cluster the data. Page 2
Given several images of the faces of people, an unsupervised classifier should be able to identify two groups, one being male and the other being female; or it should be able to identify three groups, one being toddlers, another being youth and the third being the elderly. In hard clustering, each example is placed definitively in a class. The class is then used to predict the feature values of the example. The alternative to hard clustering is soft clustering, in which each example has a probability distribution over its class. The prediction of the values for the features of an example is the weighted average of the predictions of the classes the example is in, weighted by the probability of the example being in the class.[?] 2.4 Reinforcement Learning In reinforcement learning the agent learns learns from rewards and punishments. For example, in learning how to ride a bike, you take actions several actions. Some of which keep your balance and enable you move forward, while others make you fall. You learn to ride by avoiding the actions that make you fall and doing more of those that keep your balance. Again, imagine a robot that can act in a world, receiving rewards and punishments and determining from these what it should do. This is the problem of reinforcement learning.[?] Page 3
3 K-Means Clustering 3.1 Introduction In clustering a dataset we are concerned about putting the data into groups such that similar items are in the same group, and, of course, dissimilar items are in different groups. The k-means algorithm is one of the most common clustering algorithms. Take a look at the image below. Figure 1: Raw data in scatter diagram Can you see 4 clusters in there? That was easy, right? What about the dataset below? (0.7, 5.1) (1.5, 6 ) (2.1, 4.5) (2.4, 5.5) (3, 4.4) (3.5, 5) (4.5, 1.5) (5.2, 0.7) (5.3, 1.8) (6.2, 1.7) (6.7, 2.5) (8.5, 9.2) (9.1, 9.7) (9.5, 8.5) Now things are getting tougher. You may plot this data (provided in SmallRandData.csv in Lab 6 Resources) and identify the three clusters. A MATLAB script, clusterexample.m is provided to help you with this. Let us take a look at the clusters we obtain from Figure 1. Page 4
Figure 2: Raw data partitioned into 4 clusters The red circles appear to be in the center of each cluster. These are known as the centroids of the clusters. Mathematically the centroid of a cluster is simply the arithmetic mean of the data in the cluster. You should notice that for any point in one cluster it is closer to the centroid of that cluster than to the centroid of any other cluster. You may try openexample( stats/partitiondataintotwoclustersexample ) if you have MAT- LAB installed. 3.2 The algorithm Now that we have a clear picture of the result of the algorithm let us see how it works. The K-Means algorithm The k-means algorithm aims to group n observations into k cluster such that each observation belongs to the cluster with the nearest mean. The k has to be specified before the algorithm starts. The algorithm assumes that each feature is given on a numerical scale, and it tries to find classes that minimize the sum-of-squares error when the predicted values for each example are derived from the class to which it belongs A brute-force approach to clustering numeric data would be to examine all possible combinations of the source data set and then determine which of those groupings is best. Even for a dataset of 50 observations to be clustered into 3 groups the number of possible groupings is 119,649,664,052,358,811,373,730. If you are daring enough to proceed and you can examine a billion clusters per second it will take over 3 million years to analyze all the combinations. The algorithm proceeds as follows: 1. Randomly assign data items to clusters 2. Compute the mean/centroid of each cluster 3. Reassign each data item to the cluster of the closest centroid, i.e. the cluster that minimizes the data point-to-cluster-centroid distance. Page 5
4. If there were no reassignments, a stable assignment has been found and hence clustering is complete 5. Else go back to step 2 The procedure is illustrated in the diagram below.[?] Figure 3: k-means Problem and Cluster Initialization Figure 4: Compute centroids and reassign clusters Figure 5: Update centroids and update clustering until there is no change Page 6
3.3 Implementation 1. Begin by creating a console application. 2. Each observation of the data has two values so create a class with two floating-point attributes. 3. Read the Longitude and Latitude from the data provided in HEALTH FACILITIES IN GHANA.csv [?] in Lab 6 resources into an array of your class s type. As a test, first, use the data in we saw earlier (SmallRandData.csv) with k=3 to check if your clustering is working correctly. 4. Create a function to display some of the data that has been read. 5. Create a class KMeans to handle the clustering. Create a KMeans object, set it s data to the data you read from the file and set the number of clusters, k, to a desired value. 6. Create a cluster function in your KMeans class to cluster the data and assign each data item a cluster. Invoke this function on the KMeans object to cluster the data. 7. Create another function to display the clustered data showing the clusters clearly. 8. You may write the clusters to files and then import them into MATLAB. On clustering with k = 2 and plotting the imported data you should see that the data has been clustered into the Northern and Southern hemispheres of the country and that the Southern part has more health facilities. 3.3.1 The Data class Data - x : Double - y : Double - cluster: Integer + getters and setters, etc. 3.3.2 The KMeans class KMeans - rawdata : List of Data - centroids : List of Data - k: Integer + cluster() : void + getters and setters, etc. 3.3.3 The cluster function Function cluster() /* k-means clustering algorithm */ Data: rawdata is an array of Data from file k is the number of clusters Result: rawdata with stable cluster assignments 1. Randomly assign data items clusters; while stable assignment has not been found do 2. Compute centroid for each cluster; 3. Reassign each data item to the cluster of the closest centroid, i.e. the cluster that minimizes the data point-to-cluster-centroid distance.; Algorithm 1: The summary of k-means clustering algorithm Page 7
Function cluster() /* k-means clustering algorithm */ Data: rawdata is an array of Data from file k is the number of clusters Result: rawdata with stable cluster assignments /* 1. a. Randomly assign data items clusters */ for i = 0 to rawdata.lengt H 1 do rawdata[i].clust ER = i mod k; /* 1. b. Refine randomization with the Fisher-Yates shuffle */ for i = 0 to rawdata.lengt H 1 do r = generate random naumber in range [i, rawdata.lengt H 1]; swapclusters(rawdata[i], rawdata[r]); /* Until a stable assignment has been found */ stableassignment =false; while!stableassignment do /* 2. Compute centroid for each cluster */ centroids.clear(); for i = 0 to k 1 do clusteri = rawdata.where(clu ST ER == i); centroidi = new Data; centroidi.x = clusteri.sumx()/clusteri.len GT H; centroidi.y = clusteri.sumy()/clusteri.len GT H; centroidi.clust ER = i; centroids.add(centroidi); /* For each point */ stableassignment =true; for i = 0 to rawdata.lengt H 1 do /* 3. a. Compute and minimize the point-to-centroid distance */ mincluster = 0; mindistance = distance(rawdata[i], centroid[0]); for j = 1 to k 1 do dist = distance(rawdata[i], centroid[j]); if dist < mindistance then mincluster = j; mindistance = dist; /* 3. b. Update cluster assignment if necessary */ if mincluster rawdata[i].clu ST ER then stableassignment =false; rawdata[i].clu ST ER = mincluster; For point-to-cluster-centroid distance, the Euclidean distance may be used. dist = (rawdata[i].x centroid[j].x) 2 + (rawdata[i].y centroid[j].y ) 2 Algorithm 2: The full k-means clustering algorithm Page 8
4 Challenge: Identifying the Black Pod disease The Black pod disease, also known as, Phytophthora pod rot, that affects cocoa pods is one such a destructive disease that reduces the yield of cocoa. The symptoms are 1. Translucent spots on pod surface which develop into a small, dark hard spots 2. entire pod becomes black and necrotic with 14 days of initial symptoms 3. white to yellow downy growth on black areas 4. internal tissues become dry and shriveled resulting in mummified pods To prevent the spread, mummified pods should be removed and destroyed to reduce spread.[?] Figure 6: TOP: Healthy cocoa pods. BOTTOM: Cocoa suffering black pod. Small dark ones are mummified Assuming, on a cocoa farm, we have a robot that is equipped with a computer vision system and that is able to isolate images of cocoa pods from the input images we wish to employ k-means clustering to identify the diseased ones. Page 9
4.1 Requirements 1. The program should have a user interface that allows one to choose an images of a pod to be analyzed. 2. For the selected image, read the pixel values into an RGB matrix and apply the k-means clustering algorithm on it. 3. Plot a histogram of the cluster counts versus the centroids. For each of the centroids draw a vertical bar that is colored with the color the centroid value represents and with height corresponding to the number of points in that cluster. 4. One may then analyze these histograms and make a judgment as to whether the pod is diseased or not. Automation of this will constitute supervised learning. You may use a library like JFreeChart or the inbuilt charts in JavaFX. 5. Extra credit Modify your program to work such that after clustering, it displays the image with the darkest regions highlighted purple. 5 Tasks 1. Identify 3 strengths and weaknesses of the k-means clustering algorithm. 2. Would you say k-means is a hard clustering algorithm or a soft clustering algorithm? 3. What is the difference between the k-means, the k-medoids and the k-medians algorithms? Page 10