University of Ghana Department of Computer Engineering School of Engineering Sciences College of Basic and Applied Sciences

Similar documents
Artificial Intelligence. Programming Styles

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

What to come. There will be a few more topics we will cover on supervised learning

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions

A Review on Plant Disease Detection using Image Processing

The k-means Algorithm and Genetic Algorithm

Gene Clustering & Classification

Clustering. Stat 430 Fall 2011

Unsupervised Learning : Clustering

Unsupervised Learning Partitioning Methods

Unsupervised Learning

Clustering and Visualisation of Data

LAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA

Machine Learning - Clustering. CS102 Fall 2017

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

Search. The Nearest Neighbor Problem

Clustering Color/Intensity. Group together pixels of similar color/intensity.

Associative Cellular Learning Automata and its Applications

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Data Mining. SPSS Clementine k-means Algorithm. Spring 2010 Instructor: Dr. Masoud Yaghini. Clementine

Introduction to Artificial Intelligence

Supervised vs. Unsupervised Learning

HCR Using K-Means Clustering Algorithm

INF 4300 Classification III Anne Solberg The agenda today:

k-means Clustering Todd W. Neller Gettysburg College

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Clustering CS 550: Machine Learning

[7.3, EA], [9.1, CMB]

10701 Machine Learning. Clustering

Machine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves

Clustering & Classification (chapter 15)

Clustering. Chapter 10 in Introduction to statistical learning

Machine Learning using MapReduce

Data Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners

Lecture 12 Recognition

Clustering. Supervised vs. Unsupervised Learning

Robot Learning. There are generally three types of robot learning: Learning from data. Learning by demonstration. Reinforcement learning

Cluster Analysis. CSE634 Data Mining

11/2/2017 MIST.6060 Business Intelligence and Data Mining 1. Clustering. Two widely used distance metrics to measure the distance between two records

Unsupervised Learning. CS 3793/5233 Artificial Intelligence Unsupervised Learning 1

CHAPTER 6 IDENTIFICATION OF CLUSTERS USING VISUAL VALIDATION VAT ALGORITHM

Character Recognition

Machine Learning : Clustering, Self-Organizing Maps

STAT STATISTICAL METHODS. Statistics: The science of using data to make decisions and draw conclusions

ECLT 5810 Clustering

ECLT 5810 Clustering

Figure 1 shows unstructured data when plotted on the co-ordinate axis


k-means Clustering Todd W. Neller Gettysburg College Laura E. Brown Michigan Technological University

Color based segmentation using clustering techniques

EE 589 INTRODUCTION TO ARTIFICIAL NETWORK REPORT OF THE TERM PROJECT REAL TIME ODOR RECOGNATION SYSTEM FATMA ÖZYURT SANCAR

CS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008

Nearest Neighbor Predictors

Lab # 2 - ACS I Part I - DATA COMPRESSION in IMAGE PROCESSING using SVD

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Introduction to Geospatial Analysis

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Data Science and Statistics in Research: unlocking the power of your data Session 3.4: Clustering

Maximum Entropy (Maxent)

Machine Learning for Pre-emptive Identification of Performance Problems in UNIX Servers Helen Cunningham

Introduction to Machine Learning

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Introduction to Clustering

Overview. Data Mining for Business Intelligence. Shmueli, Patel & Bruce

Intelligent Image and Graphics Processing

Jarek Szlichta

K-means Clustering & k-nn classification

CHAPTER 4 DETECTION OF DISEASES IN PLANT LEAF USING IMAGE SEGMENTATION

Data Informatics. Seon Ho Kim, Ph.D.

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

CHAPTER 3 TUMOR DETECTION BASED ON NEURO-FUZZY TECHNIQUE

Classifying Images with Visual/Textual Cues. By Steven Kappes and Yan Cao

Recent Progress on RAIL: Automating Clustering and Comparison of Different Road Classification Techniques on High Resolution Remotely Sensed Imagery

Applying Supervised Learning

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Your Name: Section: INTRODUCTION TO STATISTICAL REASONING Computer Lab #4 Scatterplots and Regression

Improving Positron Emission Tomography Imaging with Machine Learning David Fan-Chung Hsu CS 229 Fall

Using Statistical Techniques to Improve the QC Process of Swell Noise Filtering

Machine Learning in Digital Security

K-means Clustering & PCA

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

K-means clustering Based in part on slides from textbook, slides of Susan Holmes. December 2, Statistics 202: Data Mining.

An Unsupervised Technique for Statistical Data Analysis Using Data Mining

Brief Guide on Using SPSS 10.0

A Novel Template Matching Approach To Speaker-Independent Arabic Spoken Digit Recognition

Case Study: Attempts at Parametric Reduction

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

10/5/2017 MIST.6060 Business Intelligence and Data Mining 1. Nearest Neighbors. In a p-dimensional space, the Euclidean distance between two records,

Establishing Virtual Private Network Bandwidth Requirement at the University of Wisconsin Foundation

Unsupervised Learning and Data Mining

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

CS231A Course Project Final Report Sign Language Recognition with Unsupervised Feature Learning

Section 2-2 Frequency Distributions. Copyright 2010, 2007, 2004 Pearson Education, Inc

Cluster Analysis for Microarray Data

Using Machine Learning to Optimize Storage Systems

What is KNIME? workflows nodes standard data mining, data analysis data manipulation

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa

CHAPTER 4 K-MEANS AND UCAM CLUSTERING ALGORITHM

Kapitel 4: Clustering

Pattern recognition. Classification/Clustering GW Chapter 12 (some concepts) Textures

Transcription:

University of Ghana Department of Computer Engineering School of Engineering Sciences College of Basic and Applied Sciences CPEN 405: Artificial Intelligence Lab 7 November 15, 2017 Unsupervised Learning Contents 1 Introduction 2 2 Background 3 2.1 Machine Learning....................................... 3 2.2 Supervised Learning..................................... 3 2.3 Unsupervised Learning Clustering............................. 3 2.4 Reinforcement Learning................................... 4 3 K-Means Clustering 5 3.1 Introduction.......................................... 5 3.2 The algorithm......................................... 6 3.3 Implementation........................................ 8 3.3.1 The Data class.................................... 8 3.3.2 The KMeans class................................... 8 3.3.3 The cluster function................................ 8 4 Challenge: Identifying the Black Pod disease 10 4.1 Requirements......................................... 11 5 Tasks 11 Page 1

1 Introduction Now we would like to see Artificial Intelligence in a different dimension: where an agent program can recognize patterns and learn. In the field of Artificial Intelligence, Machine Learning is now in the forefront, to such extents that it is applied in Natural Language Processing and Automated Reasoning. In this lab we shall see how to develop a program that is capable of categorizing data into groups on its own using the K-Means Clustering algorithm. We shall then apply this to detection of diseased plants in agriculture. 2 Background 2.1 Machine Learning Learning is the ability of an agent to improve its behavior based on observations made about the world. This could mean the following: The range of behaviors is expanded; the agent can do more. The accuracy on tasks is improved; the agent can do things better. The speed is improved; the agent can do things faster. There are three main types of learning which are described as follows. 2.2 Supervised Learning In supervised learning, the agent is presented with example input-output pairs and learns a function to map an input to an output such that when it is presented with new inputs, it can automatically determine the corresponding outputs. An abstract definition of supervised learning is as follows. Assume the learner is given the following data: a set of input features, X 1,..., X n ; a set of target features, Y 1,..., Y k ; a set of training examples, where the values for the input features and the target features are given for each example; and a set of test examples, where only the values for the input features are given. The aim is to predict the values of the target features for the test examples and as-yet-unseen examples. Typically, learning is the creation of a representation that can make predictions based on descriptions of the input features of new examples. As an example, take a spam filter. As you identify and mark emails as spam and allow others to pass as not-spam, it learns a model from the input (emails) to the output (spam or not-spam) based on which it can classify an incoming email. 2.3 Unsupervised Learning Clustering In unsupervised learning, the agent is simply presented with raw inputs and the agent learns patterns in the input. Most commonly the agent classifies the input into bins clustering.[?]. Thus in clustering or unsupervised learning, the target features are not given in the training examples. The aim is to construct a natural classification that can be used to cluster the data. Page 2

Given several images of the faces of people, an unsupervised classifier should be able to identify two groups, one being male and the other being female; or it should be able to identify three groups, one being toddlers, another being youth and the third being the elderly. In hard clustering, each example is placed definitively in a class. The class is then used to predict the feature values of the example. The alternative to hard clustering is soft clustering, in which each example has a probability distribution over its class. The prediction of the values for the features of an example is the weighted average of the predictions of the classes the example is in, weighted by the probability of the example being in the class.[?] 2.4 Reinforcement Learning In reinforcement learning the agent learns learns from rewards and punishments. For example, in learning how to ride a bike, you take actions several actions. Some of which keep your balance and enable you move forward, while others make you fall. You learn to ride by avoiding the actions that make you fall and doing more of those that keep your balance. Again, imagine a robot that can act in a world, receiving rewards and punishments and determining from these what it should do. This is the problem of reinforcement learning.[?] Page 3

3 K-Means Clustering 3.1 Introduction In clustering a dataset we are concerned about putting the data into groups such that similar items are in the same group, and, of course, dissimilar items are in different groups. The k-means algorithm is one of the most common clustering algorithms. Take a look at the image below. Figure 1: Raw data in scatter diagram Can you see 4 clusters in there? That was easy, right? What about the dataset below? (0.7, 5.1) (1.5, 6 ) (2.1, 4.5) (2.4, 5.5) (3, 4.4) (3.5, 5) (4.5, 1.5) (5.2, 0.7) (5.3, 1.8) (6.2, 1.7) (6.7, 2.5) (8.5, 9.2) (9.1, 9.7) (9.5, 8.5) Now things are getting tougher. You may plot this data (provided in SmallRandData.csv in Lab 6 Resources) and identify the three clusters. A MATLAB script, clusterexample.m is provided to help you with this. Let us take a look at the clusters we obtain from Figure 1. Page 4

Figure 2: Raw data partitioned into 4 clusters The red circles appear to be in the center of each cluster. These are known as the centroids of the clusters. Mathematically the centroid of a cluster is simply the arithmetic mean of the data in the cluster. You should notice that for any point in one cluster it is closer to the centroid of that cluster than to the centroid of any other cluster. You may try openexample( stats/partitiondataintotwoclustersexample ) if you have MAT- LAB installed. 3.2 The algorithm Now that we have a clear picture of the result of the algorithm let us see how it works. The K-Means algorithm The k-means algorithm aims to group n observations into k cluster such that each observation belongs to the cluster with the nearest mean. The k has to be specified before the algorithm starts. The algorithm assumes that each feature is given on a numerical scale, and it tries to find classes that minimize the sum-of-squares error when the predicted values for each example are derived from the class to which it belongs A brute-force approach to clustering numeric data would be to examine all possible combinations of the source data set and then determine which of those groupings is best. Even for a dataset of 50 observations to be clustered into 3 groups the number of possible groupings is 119,649,664,052,358,811,373,730. If you are daring enough to proceed and you can examine a billion clusters per second it will take over 3 million years to analyze all the combinations. The algorithm proceeds as follows: 1. Randomly assign data items to clusters 2. Compute the mean/centroid of each cluster 3. Reassign each data item to the cluster of the closest centroid, i.e. the cluster that minimizes the data point-to-cluster-centroid distance. Page 5

4. If there were no reassignments, a stable assignment has been found and hence clustering is complete 5. Else go back to step 2 The procedure is illustrated in the diagram below.[?] Figure 3: k-means Problem and Cluster Initialization Figure 4: Compute centroids and reassign clusters Figure 5: Update centroids and update clustering until there is no change Page 6

3.3 Implementation 1. Begin by creating a console application. 2. Each observation of the data has two values so create a class with two floating-point attributes. 3. Read the Longitude and Latitude from the data provided in HEALTH FACILITIES IN GHANA.csv [?] in Lab 6 resources into an array of your class s type. As a test, first, use the data in we saw earlier (SmallRandData.csv) with k=3 to check if your clustering is working correctly. 4. Create a function to display some of the data that has been read. 5. Create a class KMeans to handle the clustering. Create a KMeans object, set it s data to the data you read from the file and set the number of clusters, k, to a desired value. 6. Create a cluster function in your KMeans class to cluster the data and assign each data item a cluster. Invoke this function on the KMeans object to cluster the data. 7. Create another function to display the clustered data showing the clusters clearly. 8. You may write the clusters to files and then import them into MATLAB. On clustering with k = 2 and plotting the imported data you should see that the data has been clustered into the Northern and Southern hemispheres of the country and that the Southern part has more health facilities. 3.3.1 The Data class Data - x : Double - y : Double - cluster: Integer + getters and setters, etc. 3.3.2 The KMeans class KMeans - rawdata : List of Data - centroids : List of Data - k: Integer + cluster() : void + getters and setters, etc. 3.3.3 The cluster function Function cluster() /* k-means clustering algorithm */ Data: rawdata is an array of Data from file k is the number of clusters Result: rawdata with stable cluster assignments 1. Randomly assign data items clusters; while stable assignment has not been found do 2. Compute centroid for each cluster; 3. Reassign each data item to the cluster of the closest centroid, i.e. the cluster that minimizes the data point-to-cluster-centroid distance.; Algorithm 1: The summary of k-means clustering algorithm Page 7

Function cluster() /* k-means clustering algorithm */ Data: rawdata is an array of Data from file k is the number of clusters Result: rawdata with stable cluster assignments /* 1. a. Randomly assign data items clusters */ for i = 0 to rawdata.lengt H 1 do rawdata[i].clust ER = i mod k; /* 1. b. Refine randomization with the Fisher-Yates shuffle */ for i = 0 to rawdata.lengt H 1 do r = generate random naumber in range [i, rawdata.lengt H 1]; swapclusters(rawdata[i], rawdata[r]); /* Until a stable assignment has been found */ stableassignment =false; while!stableassignment do /* 2. Compute centroid for each cluster */ centroids.clear(); for i = 0 to k 1 do clusteri = rawdata.where(clu ST ER == i); centroidi = new Data; centroidi.x = clusteri.sumx()/clusteri.len GT H; centroidi.y = clusteri.sumy()/clusteri.len GT H; centroidi.clust ER = i; centroids.add(centroidi); /* For each point */ stableassignment =true; for i = 0 to rawdata.lengt H 1 do /* 3. a. Compute and minimize the point-to-centroid distance */ mincluster = 0; mindistance = distance(rawdata[i], centroid[0]); for j = 1 to k 1 do dist = distance(rawdata[i], centroid[j]); if dist < mindistance then mincluster = j; mindistance = dist; /* 3. b. Update cluster assignment if necessary */ if mincluster rawdata[i].clu ST ER then stableassignment =false; rawdata[i].clu ST ER = mincluster; For point-to-cluster-centroid distance, the Euclidean distance may be used. dist = (rawdata[i].x centroid[j].x) 2 + (rawdata[i].y centroid[j].y ) 2 Algorithm 2: The full k-means clustering algorithm Page 8

4 Challenge: Identifying the Black Pod disease The Black pod disease, also known as, Phytophthora pod rot, that affects cocoa pods is one such a destructive disease that reduces the yield of cocoa. The symptoms are 1. Translucent spots on pod surface which develop into a small, dark hard spots 2. entire pod becomes black and necrotic with 14 days of initial symptoms 3. white to yellow downy growth on black areas 4. internal tissues become dry and shriveled resulting in mummified pods To prevent the spread, mummified pods should be removed and destroyed to reduce spread.[?] Figure 6: TOP: Healthy cocoa pods. BOTTOM: Cocoa suffering black pod. Small dark ones are mummified Assuming, on a cocoa farm, we have a robot that is equipped with a computer vision system and that is able to isolate images of cocoa pods from the input images we wish to employ k-means clustering to identify the diseased ones. Page 9

4.1 Requirements 1. The program should have a user interface that allows one to choose an images of a pod to be analyzed. 2. For the selected image, read the pixel values into an RGB matrix and apply the k-means clustering algorithm on it. 3. Plot a histogram of the cluster counts versus the centroids. For each of the centroids draw a vertical bar that is colored with the color the centroid value represents and with height corresponding to the number of points in that cluster. 4. One may then analyze these histograms and make a judgment as to whether the pod is diseased or not. Automation of this will constitute supervised learning. You may use a library like JFreeChart or the inbuilt charts in JavaFX. 5. Extra credit Modify your program to work such that after clustering, it displays the image with the darkest regions highlighted purple. 5 Tasks 1. Identify 3 strengths and weaknesses of the k-means clustering algorithm. 2. Would you say k-means is a hard clustering algorithm or a soft clustering algorithm? 3. What is the difference between the k-means, the k-medoids and the k-medians algorithms? Page 10