Machine Learning. Semi-Supervised Learning. Manfred Huber

Similar documents
Semi-supervised Learning

Machine Learning. Supervised Learning. Manfred Huber

Machine Learning. Unsupervised Learning. Manfred Huber

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

K-Means Clustering 3/3/17

Semi- Supervised Learning

A Taxonomy of Semi-Supervised Learning Algorithms

COMS 4771 Clustering. Nakul Verma

CSE 573: Artificial Intelligence Autumn 2010

Clustering Lecture 5: Mixture Model

Thorsten Joachims Then: Universität Dortmund, Germany Now: Cornell University, USA

Classification: Feature Vectors

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Clustering web search results

Semi-Supervised Clustering with Partial Background Information

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

CS 229 Midterm Review

What is machine learning?

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Statistics 202: Data Mining. c Jonathan Taylor. Outliers Based in part on slides from textbook, slides of Susan Holmes.

MACHINE LEARNING: CLUSTERING, AND CLASSIFICATION. Steve Tjoa June 25, 2014

Unsupervised Learning: Clustering

Unsupervised Learning

Large Scale Manifold Transduction

Announcements. CS 188: Artificial Intelligence Spring Classification: Feature Vectors. Classification: Weights. Learning: Binary Perceptron

Machine Learning. Classification

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Generative and discriminative classification techniques

Kernels and Clustering

An Introduction to Cluster Analysis. Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs

Building Classifiers using Bayesian Networks

Shape Co-analysis. Daniel Cohen-Or. Tel-Aviv University

1) Give decision trees to represent the following Boolean functions:

Unsupervised Learning. CS 3793/5233 Artificial Intelligence Unsupervised Learning 1

Random projection for non-gaussian mixture models

Feature Extractors. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. The Perceptron Update Rule.

CS 343: Artificial Intelligence

Selection of Scale-Invariant Parts for Object Class Recognition

Clustering. Image segmentation, document clustering, protein class discovery, compression

Search Engines. Information Retrieval in Practice

Semi-supervised learning

Natural Language Processing

Bioinformatics - Lecture 07

Experimenting with Multi-Class Semi-Supervised Support Vector Machines and High-Dimensional Datasets

Robot Learning. There are generally three types of robot learning: Learning from data. Learning by demonstration. Reinforcement learning

Introduction to Machine Learning CMU-10701

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Classifying Images with Visual/Textual Cues. By Steven Kappes and Yan Cao

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

The EM Algorithm Lecture What's the Point? Maximum likelihood parameter estimates: One denition of the \best" knob settings. Often impossible to nd di

An Unsupervised Approach for Combining Scores of Outlier Detection Techniques, Based on Similarity Measures

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Slides for Data Mining by I. H. Witten and E. Frank

Unsupervised Learning. Clustering and the EM Algorithm. Unsupervised Learning is Model Learning

Semi-supervised learning and active learning

Learning Better Data Representation using Inference-Driven Metric Learning

Unsupervised Learning and Clustering

Kernel-based Transductive Learning with Nearest Neighbors

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions

1 Case study of SVM (Rob)

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

On Information-Maximization Clustering: Tuning Parameter Selection and Analytic Solution

Classification: Linear Discriminant Functions

Machine Learning

Supervised vs unsupervised clustering

Unsupervised Learning and Clustering

Fall 09, Homework 5

Day 3 Lecture 1. Unsupervised Learning

Applied Statistics for Neuroscientists Part IIa: Machine Learning

Classification. 1 o Semestre 2007/2008

ECG782: Multidimensional Digital Signal Processing

Using Machine Learning to Optimize Storage Systems

Outlier Ensembles. Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY Keynote, Outlier Detection and Description Workshop, 2013

Machine Learning. Nonparametric methods for Classification. Eric Xing , Fall Lecture 2, September 12, 2016

Grundlagen der Künstlichen Intelligenz

Machine Learning (CSE 446): Unsupervised Learning

Expectation Maximization: Inferring model parameters and class labels

Learning to Detect Partially Labeled People

Note Set 4: Finite Mixture Models and the EM Algorithm

Intro to Artificial Intelligence

Gene Expression Based Classification using Iterative Transductive Support Vector Machine

Inference and Representation

IBL and clustering. Relationship of IBL with CBR

Artificial Intelligence. Programming Styles

Machine Learning Techniques for Data Mining

Ensembles. An ensemble is a set of classifiers whose combined results give the final decision. test feature vector

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

Supervised vs. Unsupervised Learning

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

Markov Random Fields and Segmentation with Graph Cuts

Co-EM Support Vector Learning

CSC 2515 Introduction to Machine Learning Assignment 2

More Learning. Ensembles Bayes Rule Neural Nets K-means Clustering EM Clustering WEKA

Expectation Maximization (EM) and Gaussian Mixture Models

Semi-supervised Document Classification with a Mislabeling Error Model

We extend SVM s in order to support multi-class classification problems. Consider the training dataset

DETECTION OF SMOOTH TEXTURE IN FACIAL IMAGES FOR THE EVALUATION OF UNNATURAL CONTRAST ENHANCEMENT

Transcription:

Machine Learning Semi-Supervised Learning Manfred Huber 2015 1

Semi-Supervised Learning Semi-supervised learning refers to learning from data where part contains desired output information and the other part does not Mostly applied to supervised learning problems (classification/regression) with partially labeled data Sometimes partially labeled data is also used to solve unsupervised learning problems (semiunsupervised learning) Labeled data provides constraints Generally unlabeled data is easier to find and more abundant than labeled data Manfred Huber 2015 2

Semi-Supervised Learning The goal of most semi-supervised learning is to learn a better classifier/regression function than from only the labeled data Training data can be divided into two sets D s = {( x (i), y (i) ) :i! [ 1..n s ]}!,!D u = x (i) [ ] {( ) :i! n s +1..n } D = D s " D u Learn h D (x) that has a higher expected performance on test data than h Ds (x) Sometimes only labeling of the unlabeled data is required transductive learning Manfred Huber 2015 3

Semi-Supervised Learning The main benefit of unsupervised data in semisupervised learning is by providing information about the structure of the overall data Assumption: labeled and unlabeled data come from the same distribution Manfred Huber 2015 4

Cluster and Label Cluster and Label is the simplest form of semisupervised learning First apply clustering to all data Then apply supervised learning on the labeled instances in each cluster Manfred Huber 2015 5

Cluster and Label But: works only if the cluster structure matches the underlying class structure We can also use unsupervised feature learning to derive new features for supervised learning Requires that feature criteria match class structure Manfred Huber 2015 6

Self-Training Self-training can be applied to any supervised learning algorithm that has a measure confidence of a label assignment E.g.: MLE, MAP, or maximum margin approaches Basic idea is to incrementally augment labeled set with self-labeled instances Apply supervised learner to the labeled data Apply learned hypothesis to the unlabeled data Move k highest confidence samples to labeled set Repeat from the second step until all data is labeled Manfred Huber 2015 7

Self-Training Self-training basically generates its own training data But: self-training is prone to reinforcing mistakes Manfred Huber 2015 8

Co-Training Self-training is a simple algorithm that often works well Can produce worse results even on labeled data than training only on the labeled data Co-training follows a similar idea but tries to reduce the self-reinforcement problem Requires two views of the same data Each data point exists in two representations x (i) =! x (i) (i) " (1), x (2) # $ Each view contains all the data but in a different representation (ideally conditionally independent Manfred Huber 2015 9

Co-Training Operation of co-training is similar to selftraining In each view, run supervised learner on labeled data In each view, label unlabeled data using hpothesis In each view, identify k most confident unlabeled instances, remove them from unlabeled set in this view and add them to labeled data sets of other view Repeat until all unlabeled data has been labeled Form final classifier/regression function by combining the learned classifiers Averaging or voting are most commonly used Manfred Huber 2015 10

Co-Training and Multi-View Co-training has fewer problems with reinforcing errors than self-training Lower probability of picking the same item mistakenly in both views Multi-view learning generalizes this concepts of learning multiple classifiers with different views Views are achieved either through different data representations or through the use of different supervised learners Use of assigned labels on unlabeled data requires that multiple classifiers agree Manfred Huber 2015 11

Co-Training and Multi-View Co-training has fewer problems with reinforcing errors than self-training Lower probability of picking the same item mistakenly in both views Multi-view learning generalizes this concepts of learning multiple classifiers with different views Views are achieved either through different data representations or through the use of different supervised learners Use of assigned labels on unlabeled data requires that multiple classifiers agree Manfred Huber 2015 12

Expectation Maximization Another way to limit self-reinforcing errors would be to allow them to be undone later Expectation maximization can be used for this with any probabilistic supervised learner! Yu argmax! P(D s, D u!) = argmax! P(D s, D u,y u!) Apply supervised learner to labeled data Use current hypothesis to compute the expectation (probability) of the label for the unlabeled data Re-train the supervised learner using labeled and unlabeled data with assigned expected label Repeat from step 2 until convergence Manfred Huber 2015 13

Example: Gaussian Mixture Hypothesis space for generative classifier is Gaussian in each class Corresponds to Gaussian mixture model clustering if no labeled data is available Manfred Huber 2015 14

Example: Gaussian Mixture Requires that the underlying assumption of the hypothesis space (i.e. in this case of the generative model) is approximately correct If not the final result can be worse Manfred Huber 2015 Can use mixture within each class Can change weight of unlabeled data in likelihood function

Expectation Maximization Expectation maximization will generally work well with probabilistic algorithms if the hypothesis space is appropriate Locally optimizes the marginal data probability starting from the supervised data solution A slightly different approach is to look at unsupervised data largely in terms of regularization Unsupervised data is not part of the main performance function but of the regularization term Manfred Huber 2015 16

Graph Regularization Use label smoothness as a regularization term Label smoothness implies that similar data items should have similar labels Graph regularization uses a weighted graph Nodes in the graph are all data points Weighted edges connect nodes and convey similarity between two nodes Fully connected graph with weights representing similarity K-nearest neighbor graph to reduce edge number Weighted -distance graph (only close nodes connected) Similarity propagates along the graph structure Manfred Huber 2015 17

Graph Regularization Graph regularization uses the data graph structure to define and propagate similarity Similarity in terms of label can be established/ propagated using a sequence of similar nodes Items that are not similar in feature space can be close in the graph Label smoothness over the graph propagates based on data items, not based on pure feature similarity Manfred Huber 2015 18

Graph Regularization Graph regularization represents label smoothness over the graph in the form of a regularization term uses the data graph structure to define and propagate similarity Similarity in terms of label can be established/ propagated using a sequence of similar nodes Manfred Huber 2015 19

Graph Regularization Graph regularization represents label smoothness over the graph in the form of a regularization term! ˆ = argmax! y (i)! h! (x (i) ) (# (x (i ),y (i ) )"D s ( ) 2 # ( ) 2 ) x (i ),x ( j ) "D!!!!!!!!!!!!!!!!!!!!!!+! w i, j h! (x (i) )! h! (x ( j) ) Squared error over labeled data with regularization using label smoothness Various algorithms can be used to solve this optimization Manfred Huber 2015 20

Semi-Supervised SVMs Semi-supervised SMV (S3VM) or Transductive SVM (TSVM) uses intrusion of unlabeled data into the margins as regularization 1 " min!,w,b 2 w 2 n s n % + C $!" (i) + #! $ (i) ' # i=1 i=n s +1 &!!!!!!s.t.!!!!y (i) (w T x (i) + b) (1)" (i),i =1,..., n s!!!!!!!!!!!!!!! (w T x (i) + b) (1)% (i),i = n s +1,..., n " (i),% (i) ( 0 Keep unlabeled data outside the margins as much as possible Optimization harder to solve due to total amount Manfred Huber 2015 21

Semi-Supervised SVMs Semi-supervised SMV (S3VM) or Transductive SVM (TSVM) uses intrusion of unlabeled data into the margins as regularization 1 " min!,w,b 2 w 2 n s n % + C $!" (i) + #! $ (i) ' # i=1 i=n s +1 &!!!!!!s.t.!!!!y (i) (w T x (i) + b) (1)" (i),i =1,..., n s!!!!!!!!!!!!!!! (w T x (i) + b) (1)% (i),i = n s +1,..., n Keep unlabeled data outside the margins " (i),% (i) ( 0 Avoids discriminant through dense regions Optimization harder to solve due to total amount Manfred Huber 2015 22

Semi-Supervised Learning Semi-supervised is aimed at improving learning tasks through the use of a mix of labeled and unlabeled data Takes advantage of unlabeled data often being easier to obtain and much more numerous Many techniques exist Learning/training regimens that can be used with existing algorithms E.g. self-learning, EM, multiview-learning Modifications of learning through regularization E.g. graph regularization, S3SVM Manfred Huber 2015 23