Semi-supervised learning

Similar documents
Clustering Lecture 9: Other Topics. Jing Gao SUNY Buffalo

Semi-supervised Clustering

Semi-Supervised Clustering with Partial Background Information

Introduction to Artificial Intelligence

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Constrained K-means Clustering with Background Knowledge. Clustering! Background Knowledge. Using Background Knowledge. The K-means Algorithm

Unsupervised Learning

Unsupervised Learning

A Novel Approach for Weighted Clustering

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

Comparing and Unifying Search-Based and Similarity-Based Approaches to Semi-Supervised Clustering

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Unsupervised Learning. Supervised learning vs. unsupervised learning. What is Cluster Analysis? Applications of Cluster Analysis

Clust Clus e t ring 2 Nov

Administrative. Machine learning code. Supervised learning (e.g. classification) Machine learning: Unsupervised learning" BANANAS APPLES

Clustering. Partition unlabeled examples into disjoint subsets of clusters, such that:

Clustering. Chapter 10 in Introduction to statistical learning

Gene Clustering & Classification

Introduction to Machine Learning CMU-10701

An Adaptive Kernel Method for Semi-Supervised Clustering

What to come. There will be a few more topics we will cover on supervised learning

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Intelligent Image and Graphics Processing

Constrained Clustering with Interactive Similarity Learning

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Cross-Instance Tuning of Unsupervised Document Clustering Algorithms

Unsupervised Learning : Clustering

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Clustering. Supervised vs. Unsupervised Learning

Cluster Analysis. CSE634 Data Mining

Multi-label classification using rule-based classifier systems

Machine Learning. Semi-Supervised Learning. Manfred Huber

Information Retrieval and Organisation

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Exploratory Analysis: Clustering

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Microarray data analysis

Chapter 9. Classification and Clustering

Unsupervised Learning

An Objective Evaluation Criterion for Clustering

K-Means. Oct Youn-Hee Han

Based on Raymond J. Mooney s slides

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University

Introduction to Machine Learning. Xiaojin Zhu

A Taxonomy of Semi-Supervised Learning Algorithms

CSE 7/5337: Information Retrieval and Web Search Document clustering I (IIR 16)

CHAPTER 5 OPTIMAL CLUSTER-BASED RETRIEVAL

K Nearest Neighbor Wrap Up K- Means Clustering. Slides adapted from Prof. Carpuat

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing


Introduction to Information Retrieval

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Package ECoL. January 22, 2018

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Machine Learning - Clustering. CS102 Fall 2017

CSE 5243 INTRO. TO DATA MINING

Lecture on Modeling Tools for Clustering & Regression

Clustering Results. Result List Example. Clustering Results. Information Retrieval

Semi-supervised graph clustering: a kernel approach

A Unified Framework to Integrate Supervision and Metric Learning into Clustering

K-Means Clustering 3/3/17

ECLT 5810 Clustering

Artificial Intelligence. Programming Styles

Machine Learning in Python. Rohith Mohan GradQuant Spring 2018

Data Informatics. Seon Ho Kim, Ph.D.

9/17/2009. Wenyan Li (Emily Li) Sep. 15, Introduction to Clustering Analysis

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

PV211: Introduction to Information Retrieval

CHAPTER 4: CLUSTER ANALYSIS

[7.3, EA], [9.1, CMB]

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Accelerating Unique Strategy for Centroid Priming in K-Means Clustering

CS47300: Web Information Search and Management

CHAPTER 4 K-MEANS AND UCAM CLUSTERING ALGORITHM

Cluster Analysis: Basic Concepts and Algorithms

C-NBC: Neighborhood-Based Clustering with Constraints

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

Data Clustering. Danushka Bollegala

Jarek Szlichta

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

Thorsten Joachims Then: Universität Dortmund, Germany Now: Cornell University, USA

Uninformed Search Methods. Informed Search Methods. Midterm Exam 3/13/18. Thursday, March 15, 7:30 9:30 p.m. room 125 Ag Hall

Clustering CS 550: Machine Learning

Semi-Supervised Fuzzy Clustering with Pairwise-Constrained Competitive Agglomeration

Unsupervised Learning and Clustering

CSE 494/598 Lecture-11: Clustering & Classification

10701 Machine Learning. Clustering

ECLT 5810 Clustering

Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Performance Measure of Hard c-means,fuzzy c-means and Alternative c-means Algorithms

CSE 5243 INTRO. TO DATA MINING

Information Integration of Partially Labeled Data

Introduction to Computer Science

Clustering Basic Concepts and Algorithms 1

Measuring Constraint-Set Utility for Partitional Clustering Algorithms

arxiv: v1 [stat.ml] 1 Feb 2016

Unsupervised Learning I: K-Means Clustering

Transcription:

Semi-supervised Learning COMP 790-90 Seminar Spring 2011 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Overview 2 Semi-supervised learning Semi-supervised classification Semi-supervised clustering Semi-supervised i clustering Search based methods Cop K-mean Seeded K-mean Constrained K-mean Similarity based methods

Supervised Classification Eample 3 Supervised Classification Eample 4

Supervised Classification Eample 5 Unsupervised Clustering Eample 6

Unsupervised Clustering Eample 7 Semi-Supervised Learning Combines labeled and unlabeled data during training to improve performance: Semi-supervised classification: Training on labeled data eploits additional unlabeled data, frequently resulting li in a more accurate classifier Semi-supervised clustering: Uses small amount of lblddt labeled data toaid and bias theclustering t of unlabeled lbld data 8

Semi-Supervised Classification Eample 9 Semi-Supervised Classification Eample 10

Semi-Supervised Classification Algorithms: Semisupervised EM [Ghahramani:NIPS94,Nigam:ML00] Co-training [Blum:COLT98] Transductive SVM s [Vapnik:98,Joachims:ICML99] Assumptions: Known, fied set of categories given in the labeled data Goal is to improve classification of eamples into these known categories 11 Semi-Supervised Clustering Eample 12

Semi-Supervised Clustering Eample 13 Second Semi-Supervised Clustering Eample 14

Second Semi-Supervised Clustering Eample 15 Semi-Supervised Clustering Can group data using the categories in the initial labeled data Can also etend and modify the eisting set of categories as needed to reflect other regularities in the data Can cluster a disjoint i set of unlabeled lblddt data using the labeled data as a guide to the type of clusters desired d 16

Problem definition Input: A set of unlabeled objects Some domain knowledge Output: A partitioning of the objects into clusters Objective: Maimum intra-cluster similarity Minimum inter-cluster similarity High consistency between the partitioning and the domain knowledge 17 What is Domain Knowledge? Must-link and cannot-link Class labels Ontology 18

Why semi-supervised clustering? Why not clustering? Could not incorporate prior knowledge into clustering process Why not classification? Sometimes there are insufficient labeled data Potential applications Bioinformatics (gene and protein clustering) Document hierarchy construction News/email categorization Image categorization 19 Semi-Supervised Clustering Approaches Search-based Semi-Supervised Clustering Alter the clustering algorithm using the constraints Similarity-based Semi-Supervised Clustering Alter the similarity measure based on the constraints Combination of both 20

Search-Based Semi- Supervised Clustering Alter the clustering algorithm that t searches for a good partitioning by: Modifying the objective function to give a reward for obeying labels on the supervised data [Demeriz:ANNIE99] Enforcing constraints (must-link, cannot-link) on the labeled data during clustering [Wagstaff:ICML00, Wagstaff:ICML01] Use the labeled data to initialize clusters in an iterative refinement algorithm (kmeans, EM) [Basu:ICML02] 21 Unsupervised KMeans Clustering KMeans iteratively partitions a dataset into K clusters 22 Algorithm: Initialize K cluster centers until convergence: K { l } l 1 randomly Repeat Cluster Assignment Step: Assign each data point to the cluster X l, such that L 2 distance of from (center of X l ) is minimum Center Re-estimation i Step: Re-estimate each cluster center as the mean of the points in that cluster l l

KMeans Objective Function Locally minimizes sum of squared distance between the data points and their corresponding cluster centers: K l 1 X i l i l 2 23 Initialization of K cluster centers: Totally random Random perturbation from global mean Heuristic to ensure well-separated centers etc K Means Eample 24

K Means Eample Randomly Initialize Means 25 K Means Eample Assign Points to Clusters 26

K Means Eample Re-estimate Means 27 K Means Eample Re-assign Points to Clusters 28

K Means Eample Re-estimate Means 29 K Means Eample Re-assign Points to Clusters 30

K Means Eample Re-estimate Means and Converge 31 Semi-Supervised K-Means Constraints (Must-link, Cannot-link) COP K-Means Partial label information is given Seeded dk-means (Basu, ICML 02) Constrained K-Means 32

COP K-Means COP K-Means is K-Means with must-link (must be in same cluster) and cannot-link (cannot be in same cluster) constraints on data points Initialization: Cluster centers are chosen randomly but no must-link constraints that may be violated Algorithm: During cluster assignment step in COP-K-Means, a point is assigned to its nearest cluster without violating any of its constraints If no such assignment eists, abort Based on Wagstaff et al: ICML01 33 COP K-Means Algorithm 34

Illustration Determine its label Must-link Assign to the red class 35 Illustration Determine its label Cannot-link Assign to the red class 36

Illustration Determine its label Must-link Cannot-link The clustering algorithm fails 37 Evaluation Rand inde: measures the agreement between two partitions, P1 and P2, of the same data set D Each partition is viewed as a collection of n(n-1)/2 pairwise decisions, where n is the size of D a is the number of decisions where P1 and P2 put a pair of objects into the same cluster b is the number of decisions where two instances are placed in different clusters in both partitions Total agreement can then be calculated using Rand(P1; P2) = (a + b)/ (n (n -1)/2) 38

Evaluation 39 Semi-Supervised K-Means Seeded K-Means: Labeled data provided by user are used for initialization: initial center for cluster i is the mean of the seed points having label i Seed points are only used for initialization, and not in subsequent steps Constrained K-Means: Labeled data provided by user are used to initialize K-Means algorithm Cluster labels of seed data are kept unchanged in the cluster assignment steps, and only the labels of the non-seed data are reestimated Based on Basu et al, ICML 02 02 40

Seeded K-Means Use labeled data to find the initial centroids and then run K-Means The labels for seeded points may change 41 Seeded K-Means Eample 42

Seeded K-Means Eample Initialize Means Using Labeled Data 43 Seeded K-Means Eample Assign Points to Clusters 44

Seeded K-Means Eample Re-estimate Means 45 Seeded K-Means Eample Assign points to clusters and Converge the label is changed 46

Constrained K-Means Use labeled data to find the initial centroids and then run K-Means The labels for seeded points will not change 47 Constrained K-Means Eample 48

Constrained K-Means Eample Initialize Means Using Labeled Data 49 Constrained K-Means Eample Assign Points to Clusters 50

Constrained K-Means Eample Re-estimate Means and Converge 51 Datasets Data sets: UCI Iris (3 classes; 150 instances) CMU 20 Newsgroups (20 classes; 20,000 instances) Yahoo! News (20 classes; 2,340 instances) Data subsets created for eperiments: Small-20 newsgroup: random sample of 100 documents from each newsgroup, created to study effect of datasize on algorithms Different-3 newsgroup: 3 very different newsgroups (altatheism, recsportbaseball, scispace), created to study effect of data separability ab on algorithms ago Same-3 newsgroup: 3 very similar newsgroups (compgraphics, composms-windows, compwindows) 52

Evaluation Objective function Mutual information 53 Results: MI and Seeding Zero noise in seeds [Small-20 NewsGroup] Semi-Supervised KMeans substantially better than unsupervised KMeans 54

Results: Objective function and Seeding 55 User-labeling consistent with KMeans assumptions [Small-20 NewsGroup] Obj function of data partition increases eponentially with seed fraction Results: Objective Function and Seeding User-labeling inconsistent with KMeans assumptions [Yahoo! News] Objective function of constrained algorithms decreases with seeding 56