Semi-supervised Learning COMP 790-90 Seminar Spring 2011 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Overview 2 Semi-supervised learning Semi-supervised classification Semi-supervised clustering Semi-supervised i clustering Search based methods Cop K-mean Seeded K-mean Constrained K-mean Similarity based methods
Supervised Classification Eample 3 Supervised Classification Eample 4
Supervised Classification Eample 5 Unsupervised Clustering Eample 6
Unsupervised Clustering Eample 7 Semi-Supervised Learning Combines labeled and unlabeled data during training to improve performance: Semi-supervised classification: Training on labeled data eploits additional unlabeled data, frequently resulting li in a more accurate classifier Semi-supervised clustering: Uses small amount of lblddt labeled data toaid and bias theclustering t of unlabeled lbld data 8
Semi-Supervised Classification Eample 9 Semi-Supervised Classification Eample 10
Semi-Supervised Classification Algorithms: Semisupervised EM [Ghahramani:NIPS94,Nigam:ML00] Co-training [Blum:COLT98] Transductive SVM s [Vapnik:98,Joachims:ICML99] Assumptions: Known, fied set of categories given in the labeled data Goal is to improve classification of eamples into these known categories 11 Semi-Supervised Clustering Eample 12
Semi-Supervised Clustering Eample 13 Second Semi-Supervised Clustering Eample 14
Second Semi-Supervised Clustering Eample 15 Semi-Supervised Clustering Can group data using the categories in the initial labeled data Can also etend and modify the eisting set of categories as needed to reflect other regularities in the data Can cluster a disjoint i set of unlabeled lblddt data using the labeled data as a guide to the type of clusters desired d 16
Problem definition Input: A set of unlabeled objects Some domain knowledge Output: A partitioning of the objects into clusters Objective: Maimum intra-cluster similarity Minimum inter-cluster similarity High consistency between the partitioning and the domain knowledge 17 What is Domain Knowledge? Must-link and cannot-link Class labels Ontology 18
Why semi-supervised clustering? Why not clustering? Could not incorporate prior knowledge into clustering process Why not classification? Sometimes there are insufficient labeled data Potential applications Bioinformatics (gene and protein clustering) Document hierarchy construction News/email categorization Image categorization 19 Semi-Supervised Clustering Approaches Search-based Semi-Supervised Clustering Alter the clustering algorithm using the constraints Similarity-based Semi-Supervised Clustering Alter the similarity measure based on the constraints Combination of both 20
Search-Based Semi- Supervised Clustering Alter the clustering algorithm that t searches for a good partitioning by: Modifying the objective function to give a reward for obeying labels on the supervised data [Demeriz:ANNIE99] Enforcing constraints (must-link, cannot-link) on the labeled data during clustering [Wagstaff:ICML00, Wagstaff:ICML01] Use the labeled data to initialize clusters in an iterative refinement algorithm (kmeans, EM) [Basu:ICML02] 21 Unsupervised KMeans Clustering KMeans iteratively partitions a dataset into K clusters 22 Algorithm: Initialize K cluster centers until convergence: K { l } l 1 randomly Repeat Cluster Assignment Step: Assign each data point to the cluster X l, such that L 2 distance of from (center of X l ) is minimum Center Re-estimation i Step: Re-estimate each cluster center as the mean of the points in that cluster l l
KMeans Objective Function Locally minimizes sum of squared distance between the data points and their corresponding cluster centers: K l 1 X i l i l 2 23 Initialization of K cluster centers: Totally random Random perturbation from global mean Heuristic to ensure well-separated centers etc K Means Eample 24
K Means Eample Randomly Initialize Means 25 K Means Eample Assign Points to Clusters 26
K Means Eample Re-estimate Means 27 K Means Eample Re-assign Points to Clusters 28
K Means Eample Re-estimate Means 29 K Means Eample Re-assign Points to Clusters 30
K Means Eample Re-estimate Means and Converge 31 Semi-Supervised K-Means Constraints (Must-link, Cannot-link) COP K-Means Partial label information is given Seeded dk-means (Basu, ICML 02) Constrained K-Means 32
COP K-Means COP K-Means is K-Means with must-link (must be in same cluster) and cannot-link (cannot be in same cluster) constraints on data points Initialization: Cluster centers are chosen randomly but no must-link constraints that may be violated Algorithm: During cluster assignment step in COP-K-Means, a point is assigned to its nearest cluster without violating any of its constraints If no such assignment eists, abort Based on Wagstaff et al: ICML01 33 COP K-Means Algorithm 34
Illustration Determine its label Must-link Assign to the red class 35 Illustration Determine its label Cannot-link Assign to the red class 36
Illustration Determine its label Must-link Cannot-link The clustering algorithm fails 37 Evaluation Rand inde: measures the agreement between two partitions, P1 and P2, of the same data set D Each partition is viewed as a collection of n(n-1)/2 pairwise decisions, where n is the size of D a is the number of decisions where P1 and P2 put a pair of objects into the same cluster b is the number of decisions where two instances are placed in different clusters in both partitions Total agreement can then be calculated using Rand(P1; P2) = (a + b)/ (n (n -1)/2) 38
Evaluation 39 Semi-Supervised K-Means Seeded K-Means: Labeled data provided by user are used for initialization: initial center for cluster i is the mean of the seed points having label i Seed points are only used for initialization, and not in subsequent steps Constrained K-Means: Labeled data provided by user are used to initialize K-Means algorithm Cluster labels of seed data are kept unchanged in the cluster assignment steps, and only the labels of the non-seed data are reestimated Based on Basu et al, ICML 02 02 40
Seeded K-Means Use labeled data to find the initial centroids and then run K-Means The labels for seeded points may change 41 Seeded K-Means Eample 42
Seeded K-Means Eample Initialize Means Using Labeled Data 43 Seeded K-Means Eample Assign Points to Clusters 44
Seeded K-Means Eample Re-estimate Means 45 Seeded K-Means Eample Assign points to clusters and Converge the label is changed 46
Constrained K-Means Use labeled data to find the initial centroids and then run K-Means The labels for seeded points will not change 47 Constrained K-Means Eample 48
Constrained K-Means Eample Initialize Means Using Labeled Data 49 Constrained K-Means Eample Assign Points to Clusters 50
Constrained K-Means Eample Re-estimate Means and Converge 51 Datasets Data sets: UCI Iris (3 classes; 150 instances) CMU 20 Newsgroups (20 classes; 20,000 instances) Yahoo! News (20 classes; 2,340 instances) Data subsets created for eperiments: Small-20 newsgroup: random sample of 100 documents from each newsgroup, created to study effect of datasize on algorithms Different-3 newsgroup: 3 very different newsgroups (altatheism, recsportbaseball, scispace), created to study effect of data separability ab on algorithms ago Same-3 newsgroup: 3 very similar newsgroups (compgraphics, composms-windows, compwindows) 52
Evaluation Objective function Mutual information 53 Results: MI and Seeding Zero noise in seeds [Small-20 NewsGroup] Semi-Supervised KMeans substantially better than unsupervised KMeans 54
Results: Objective function and Seeding 55 User-labeling consistent with KMeans assumptions [Small-20 NewsGroup] Obj function of data partition increases eponentially with seed fraction Results: Objective Function and Seeding User-labeling inconsistent with KMeans assumptions [Yahoo! News] Objective function of constrained algorithms decreases with seeding 56