Semi-supervised learning SSL (on graphs)

Similar documents
INTRO TO SEMI-SUPERVISED LEARNING (SSL)

Semi-Supervised Learning: Lecture Notes

Semi-supervised Learning

Thorsten Joachims Then: Universität Dortmund, Germany Now: Cornell University, USA

Semi-supervised learning and active learning

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

Transductive Phoneme Classification Using Local Scaling And Confidence

Graph-based Semi- Supervised Learning as Optimization

A Taxonomy of Semi-Supervised Learning Algorithms

Graph-based Techniques for Searching Large-Scale Noisy Multimedia Data

Semi-Supervised Clustering with Partial Background Information

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines

LDA for Big Data - Outline

Adaptation of Graph-Based Semi-Supervised Methods to Large-Scale Text Data

Large Scale Manifold Transduction

CSE 573: Artificial Intelligence Autumn 2010

Classification: Feature Vectors

Semi- Supervised Learning

Collective classification in network data

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

Efficient Iterative Semi-supervised Classification on Manifold

Multi-label classification using rule-based classifier systems

Slides for Data Mining by I. H. Witten and E. Frank

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

Introduction to Machine Learning. Xiaojin Zhu

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

9 Classification: KNN and SVM

Based on Raymond J. Mooney s slides

Accurate Semi-supervised Classification for Graph Data

Scaling Graph-based Semi Supervised Learning to Large Number of Labels Using Count-Min Sketch

Learning Better Data Representation using Inference-Driven Metric Learning

Lecture #11: The Perceptron

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

Announcements. CS 188: Artificial Intelligence Spring Classification: Feature Vectors. Classification: Weights. Learning: Binary Perceptron

SOCIAL MEDIA MINING. Data Mining Essentials

UVA CS 6316/4501 Fall 2016 Machine Learning. Lecture 15: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

Supervised Learning: Nearest Neighbors

Large-Scale Face Manifold Learning

10601 Machine Learning. Model and feature selection

Machine Learning Techniques for Data Mining

1 Case study of SVM (Rob)

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Composite Likelihood Data Augmentation for Within-Network Statistical Relational Learning

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Machine Learning (CSE 446): Unsupervised Learning

UVA CS 4501: Machine Learning. Lecture 10: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

All lecture slides will be available at CSC2515_Winter15.html

Data mining. Classification k-nn Classifier. Piotr Paszek. (Piotr Paszek) Data mining k-nn 1 / 20

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

(Graph-based) Semi-Supervised Learning. Partha Pratim Talukdar Indian Institute of Science

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Machine Learning: Algorithms and Applications Mockup Examination

Machine Learning. Semi-Supervised Learning. Manfred Huber

Announcements: projects

Evaluation. Evaluate what? For really large amounts of data... A: Use a validation set.

What to come. There will be a few more topics we will cover on supervised learning

Supervised vs unsupervised clustering

CS 343: Artificial Intelligence

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Classification Algorithms in Data Mining

Supervised Learning. Decision trees Artificial neural nets K-nearest neighbor Support vectors Linear regression Logistic regression...

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Part I: Data Mining Foundations

Linear methods for supervised learning

Python With Data Science

Unlabeled Data Classification by Support Vector Machines

A Note on Semi-Supervised Learning using Markov Random Fields

K Nearest Neighbor Wrap Up K- Means Clustering. Slides adapted from Prof. Carpuat

Data Preprocessing. Supervised Learning

Ensemble Learning. Another approach is to leverage the algorithms we have via ensemble methods

Kernels and Clustering

Exam Review Session. William Cohen

Using PageRank in Feature Selection

Link prediction in graph construction for supervised and semi-supervised learning

Introduction to Automated Text Analysis. bit.ly/poir599

Nearest neighbors classifiers

Classifiers and Detection. D.A. Forsyth

Mining di Dati Web. Lezione 3 - Clustering and Classification

Parametrizing the easiness of machine learning problems. Sanjoy Dasgupta, UC San Diego

Weka ( )

Edge Classification in Networks

Feature Extractors. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. The Perceptron Update Rule.

Link Prediction for Social Network

Data Mining Classification: Alternative Techniques. Lecture Notes for Chapter 4. Instance-Based Learning. Introduction to Data Mining, 2 nd Edition

Search Engines. Information Retrieval in Practice

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

CS 229 Midterm Review

Problem 1: Complexity of Update Rules for Logistic Regression

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

CS570: Introduction to Data Mining

Bipartite Edge Prediction via Transductive Learning over Product Graphs

MapReduce ML & Clustering Algorithms

CS249: ADVANCED DATA MINING

Midterm Examination CS540-2: Introduction to Artificial Intelligence

Generative and discriminative classification techniques

Chapter 4: Non-Parametric Techniques

CS512 (Spring 2012) Advanced Data Mining : Midterm Exam I

6.034 Quiz 2, Spring 2005

Transcription:

Semi-supervised learning SSL (on graphs) 1

Announcement No office hour for William after class today! 2

Semi-supervised learning Given: A pool of labeled examples L A (usually larger) pool of unlabeled examples U Option 1 for using L and U : Ignore U and use supervised learning on L Option 2: Ignore labels in L+U and use k-means, etc find clusters; then label each cluster using L Question: Can you use both L and U to do better? 3

SSL is Somewhere Between Clustering and Supervised Learning 4

SSL is Between Clustering and SL 5

What is a natural grouping among these objects? slides: Bhavana Dalvi 6

SSL is Between Clustering and SL clustering is unconstrained and may not give you what you want maybe this clustering is as good as the other 7

SSL is Between Clustering and SL 8

SSL is Between Clustering and SL 9

SSL is Between Clustering and SL supervised learning with few labels is also unconstrained and may not give you what you want 10

SSL is Between Clustering and SL 11

SSL is Between Clustering and SL This clustering isn t consistent with the labels 12

SSL is Between Clustering and SL 13

SSL in Action: The NELL System 14

Type of SSL Margin-based: transductive SVM Logistic regression with entropic regularization Generative: seeded k-means Nearest-neighbor like: graph-based SSL 15

Harmonic Fields aka coem aka wvrn 16

Idea: construct a graph connecting the most similar examples (k-nn graph) Intuition: nearby points should have similar labels labels should propagate through the graph Formalization: try and minimize energy defined as: In this example y is a length- 10 vector Harmonic fields Gharamani, Lafferty and Zhu Observed label 17

Result 1: at the minimal energy state, each node s value is a weighted average of its neighbor s weights: Harmonic fields Gharamani, Lafferty and Zhu Observed label 18

Harmonic field LP algorithm Result 2: you can reach the minimal energy state with a simple iterative algorithm: Step 1: For each seed example (x i,y i ): Let V 0 (i,c) = [ y i = c ] Step 2: for t=1,,t --- T is about 5 Let V t+1 (i,c) =weighted average of V t+1 (j,c) for all j that are linked to i, and renormalize V t +1 (i,c) = 1 Z j w i, j V t ( j,c) For seeds, reset V t+1 (i,c) = [ y i = c ] 19

Harmonic fields Gharamani, Lafferty and Zhu This family of techniques is called Label propagation 20

Harmonic fields Gharamani, Lafferty and Zhu This experiment points out some of the issues with LP: 1. What distance metric do you use? 2. What energy function do you minimize? 3. What is the right value for K in your K-NN graph? Is a K-NN graph right? 4. If you have lots of data, how expensive is it to build the graph? This family of techniques is called Label propagation 21

NELL: Uses Co-EM ~= HF Extract cities: Paris Pittsburgh Seattle Cupertino Examples San Francisco Austin denial anxiety selfishness Berlin mayor of arg1 live in arg1 arg1 is home of traits such as arg1 Features 22

Semi-Supervised Bootstrapped Learning via Label Propagation mayor of arg1 arg1 is home of Paris Pittsburgh San Francisco Austin anxiety live in arg1 traits such as arg1 Seattle denial selfishness 23

Semi-Supervised Bootstrapped Learning via Label Propagation mayor of arg1 arg1 is home of Paris Pittsburgh San Francisco Austin Information from other categories tells you anxiety how far (when to stop propagating) Seattle live in arg1 denial denial traits such as arg1 traits such as arg1 arrogance selfishness selfishness Nodes near seeds Nodes far from seeds 24

Difference: graph construction is not instance-to-instance but instance-to-feature Paris Pittsburgh San Francisco Austin Important reformulation: the k- NN graph is expensive to build, the instancefeature graph may not anxiety be Seattle denial selfishness 25

Some other general issues with SSL How much unlabeled data do you want? Suppose you re optimizing J = J L (L) + J U (U) If U >> L does J U dominate J? If so you re basically just clustering Often we need to balance J L and J U Besides L, what other information about the task is useful (or necessary)? Common choice: relative frequency of classes Various ways of incorporating this into the optimization problem 26

ASONAM-2010 (Advances in Social Networks Analysis and Mining) 27

Network Datasets with Known Classes UBMCBlog AGBlog MSPBlog Cora Citeseer 28

RWR - fixpoint of: aka Personalized PageRank Seed selection 1. order by PageRank, degree, or randomly 2. go down list until you have at least k examples/class 29

HF method Results Blog data Random Degree PageRank 30

Results More blog data Random Degree PageRank 31

Results Citation data Random Degree PageRank 32

Seeding MultiRankWalk 33

Seeding HF/wvRN 34

MultiRank Walk vs HF/wvRN/CoEM Seeds are marked S HF MRW 35

Back to Experiments: Network Datasets with Known Classes UBMCBlog AGBlog MSPBlog Cora Citeseer 36

MultiRankWalk vs wvrn/hf/coem 37

Harmonic Fields aka coem aka wvrn 38

CoEM/HF/wvRN One definition [MacKassey & Provost, JMLR 2007]: Simple relational classifier is same as the harmonic field the score of each node in the graph is the harmonic (linearly weighted) average of its neighbors scores. 39

CoEM/wvRN/HF Another justification of the same algorithm goes back to 2003 start with cotraining with a naïve Bayes learner 40

CoEM/wvRN/HF One algorithm with several justifications. One is to start with co-training with a naïve Bayes learner And compare to an EM version of naïve Bayes E: soft-classify unlabeled examples with NB classifier M: re-train classifier with soft-labeled examples 41

CoEM/wvRN/HF A second experiment each + example: concatenate features from two documents, one of class A+, one of class B+ each - example: concatenate features from two documents, one of class A-, one of class B- features are prefixed with A, B è disjoint 42

CoEM/wvRN/HF A second experiment each + example: concatenate features from two documents, one of class A+, one of class B+ each - example: concatenate features from two documents, one of class A-, one of class B- features are prefixed with A, B è disjoint NOW co-training outperforms EM 43

CoEM/wvRN/HF Co-training with a naïve Bayes learner vs an EM version of naïve Bayes E: soft-classify unlabeled examples with NB classifier M: re-train classifier with soft-labeled examples incremental hard assignments iterative soft assignments 44

Co-EM for a Rote Learner: equivalent to HF on a bipartite graph Pittsburgh NPs + + - - - + + + + + - - - - - contexts + + - - - - lives in _ 45

SSL AS OPTIMIZATION 46

SSL as optimization and Modified Adsorption slides from Partha Talukdar 47

48

yet another name for HF/wvRN/coEM 49

match seeds smoothness prior 50

Adsorption SSL algorithm 51

52

53

How to do this minimization? First, differentiate to find min is at Jacobi method: To solve Ax=b for x Iterate: or: 54

55

56

/HF/ precisionrecall break even point 57

/HF/ 58

/HF/ 59

from HTML tables on the web that are used for data, not formatting from mining patterns like musicians such as Bob Dylan 60

61

62

MAD SKETCHES 63

Followup work (AIStats 2014) Propagating labels requires usually small number of optimization passes Basically like label propagation passes Each is linear in the number of edges and the number of labels being propagated Can you do better? basic idea: store labels in a countmin sketch which is basically an compact approximation of an objectàdouble mapping 64

Count-min sketches split a real vector into k ranges, one for each hash function 0 0 0 0 0 0 0 0 0 cm.inc( fred flintstone, 3): h1 h2 h3 add the value to each hash location 0 3 0 3 0 0 0 3 0 cm.inc( barney rubble,5): h1 h2 h3 5 3 0 8 0 0 5 3 0 65

Count-min sketches split a real vector into k ranges, one for each hash function 0 0 0 0 0 0 0 0 0 cm.get( fred flintstone ): h1 h2 3 h3 take min when retrieving a value 5 3 0 8 0 0 5 3 0 cm.get( barney rubble): h1 h2 5 h3 5 3 0 8 0 0 5 3 0 66

Followup work (AIStats 2014) Propagating labels requires usually small number of optimization passes Basically like label propagation passes Each is linear in the number of edges and the number of labels being propagated the sketch size sketches can be combined linearly without unpacking them: sketch(av + bw) = a*sketch(v)+b*sketch(w) sketchs are good at storing skewed distributions 67

Followup work (AIStats 2014) Label distributions are often very skewed sparse initial labels community structure: labels from other subcommunities have small weight 68

Followup work (AIStats 2014) self-injection : similarity computation Freebase Flick-10k 69

Followup work (AIStats 2014) Freebase 70

Followup work (AIStats 2014) 100 Gb available 71

Even more recent work AIStats 2016 72

Differences: objective function seeds smoothness close to uniform label distribution normalized predictions 73

Differences: scaling up Updates done in parallel with Pregel Replace count-min sketch with streaming approach updates from neighbors are a stream break stream into sections maintain a list of (y, Prob(y), Δ) filter out labels and end of section if Prob(y)+Δ is small 74

Results with EXPANDER 75