Semi-supervised learning SSL (on graphs) 1
Announcement No office hour for William after class today! 2
Semi-supervised learning Given: A pool of labeled examples L A (usually larger) pool of unlabeled examples U Option 1 for using L and U : Ignore U and use supervised learning on L Option 2: Ignore labels in L+U and use k-means, etc find clusters; then label each cluster using L Question: Can you use both L and U to do better? 3
SSL is Somewhere Between Clustering and Supervised Learning 4
SSL is Between Clustering and SL 5
What is a natural grouping among these objects? slides: Bhavana Dalvi 6
SSL is Between Clustering and SL clustering is unconstrained and may not give you what you want maybe this clustering is as good as the other 7
SSL is Between Clustering and SL 8
SSL is Between Clustering and SL 9
SSL is Between Clustering and SL supervised learning with few labels is also unconstrained and may not give you what you want 10
SSL is Between Clustering and SL 11
SSL is Between Clustering and SL This clustering isn t consistent with the labels 12
SSL is Between Clustering and SL 13
SSL in Action: The NELL System 14
Type of SSL Margin-based: transductive SVM Logistic regression with entropic regularization Generative: seeded k-means Nearest-neighbor like: graph-based SSL 15
Harmonic Fields aka coem aka wvrn 16
Idea: construct a graph connecting the most similar examples (k-nn graph) Intuition: nearby points should have similar labels labels should propagate through the graph Formalization: try and minimize energy defined as: In this example y is a length- 10 vector Harmonic fields Gharamani, Lafferty and Zhu Observed label 17
Result 1: at the minimal energy state, each node s value is a weighted average of its neighbor s weights: Harmonic fields Gharamani, Lafferty and Zhu Observed label 18
Harmonic field LP algorithm Result 2: you can reach the minimal energy state with a simple iterative algorithm: Step 1: For each seed example (x i,y i ): Let V 0 (i,c) = [ y i = c ] Step 2: for t=1,,t --- T is about 5 Let V t+1 (i,c) =weighted average of V t+1 (j,c) for all j that are linked to i, and renormalize V t +1 (i,c) = 1 Z j w i, j V t ( j,c) For seeds, reset V t+1 (i,c) = [ y i = c ] 19
Harmonic fields Gharamani, Lafferty and Zhu This family of techniques is called Label propagation 20
Harmonic fields Gharamani, Lafferty and Zhu This experiment points out some of the issues with LP: 1. What distance metric do you use? 2. What energy function do you minimize? 3. What is the right value for K in your K-NN graph? Is a K-NN graph right? 4. If you have lots of data, how expensive is it to build the graph? This family of techniques is called Label propagation 21
NELL: Uses Co-EM ~= HF Extract cities: Paris Pittsburgh Seattle Cupertino Examples San Francisco Austin denial anxiety selfishness Berlin mayor of arg1 live in arg1 arg1 is home of traits such as arg1 Features 22
Semi-Supervised Bootstrapped Learning via Label Propagation mayor of arg1 arg1 is home of Paris Pittsburgh San Francisco Austin anxiety live in arg1 traits such as arg1 Seattle denial selfishness 23
Semi-Supervised Bootstrapped Learning via Label Propagation mayor of arg1 arg1 is home of Paris Pittsburgh San Francisco Austin Information from other categories tells you anxiety how far (when to stop propagating) Seattle live in arg1 denial denial traits such as arg1 traits such as arg1 arrogance selfishness selfishness Nodes near seeds Nodes far from seeds 24
Difference: graph construction is not instance-to-instance but instance-to-feature Paris Pittsburgh San Francisco Austin Important reformulation: the k- NN graph is expensive to build, the instancefeature graph may not anxiety be Seattle denial selfishness 25
Some other general issues with SSL How much unlabeled data do you want? Suppose you re optimizing J = J L (L) + J U (U) If U >> L does J U dominate J? If so you re basically just clustering Often we need to balance J L and J U Besides L, what other information about the task is useful (or necessary)? Common choice: relative frequency of classes Various ways of incorporating this into the optimization problem 26
ASONAM-2010 (Advances in Social Networks Analysis and Mining) 27
Network Datasets with Known Classes UBMCBlog AGBlog MSPBlog Cora Citeseer 28
RWR - fixpoint of: aka Personalized PageRank Seed selection 1. order by PageRank, degree, or randomly 2. go down list until you have at least k examples/class 29
HF method Results Blog data Random Degree PageRank 30
Results More blog data Random Degree PageRank 31
Results Citation data Random Degree PageRank 32
Seeding MultiRankWalk 33
Seeding HF/wvRN 34
MultiRank Walk vs HF/wvRN/CoEM Seeds are marked S HF MRW 35
Back to Experiments: Network Datasets with Known Classes UBMCBlog AGBlog MSPBlog Cora Citeseer 36
MultiRankWalk vs wvrn/hf/coem 37
Harmonic Fields aka coem aka wvrn 38
CoEM/HF/wvRN One definition [MacKassey & Provost, JMLR 2007]: Simple relational classifier is same as the harmonic field the score of each node in the graph is the harmonic (linearly weighted) average of its neighbors scores. 39
CoEM/wvRN/HF Another justification of the same algorithm goes back to 2003 start with cotraining with a naïve Bayes learner 40
CoEM/wvRN/HF One algorithm with several justifications. One is to start with co-training with a naïve Bayes learner And compare to an EM version of naïve Bayes E: soft-classify unlabeled examples with NB classifier M: re-train classifier with soft-labeled examples 41
CoEM/wvRN/HF A second experiment each + example: concatenate features from two documents, one of class A+, one of class B+ each - example: concatenate features from two documents, one of class A-, one of class B- features are prefixed with A, B è disjoint 42
CoEM/wvRN/HF A second experiment each + example: concatenate features from two documents, one of class A+, one of class B+ each - example: concatenate features from two documents, one of class A-, one of class B- features are prefixed with A, B è disjoint NOW co-training outperforms EM 43
CoEM/wvRN/HF Co-training with a naïve Bayes learner vs an EM version of naïve Bayes E: soft-classify unlabeled examples with NB classifier M: re-train classifier with soft-labeled examples incremental hard assignments iterative soft assignments 44
Co-EM for a Rote Learner: equivalent to HF on a bipartite graph Pittsburgh NPs + + - - - + + + + + - - - - - contexts + + - - - - lives in _ 45
SSL AS OPTIMIZATION 46
SSL as optimization and Modified Adsorption slides from Partha Talukdar 47
48
yet another name for HF/wvRN/coEM 49
match seeds smoothness prior 50
Adsorption SSL algorithm 51
52
53
How to do this minimization? First, differentiate to find min is at Jacobi method: To solve Ax=b for x Iterate: or: 54
55
56
/HF/ precisionrecall break even point 57
/HF/ 58
/HF/ 59
from HTML tables on the web that are used for data, not formatting from mining patterns like musicians such as Bob Dylan 60
61
62
MAD SKETCHES 63
Followup work (AIStats 2014) Propagating labels requires usually small number of optimization passes Basically like label propagation passes Each is linear in the number of edges and the number of labels being propagated Can you do better? basic idea: store labels in a countmin sketch which is basically an compact approximation of an objectàdouble mapping 64
Count-min sketches split a real vector into k ranges, one for each hash function 0 0 0 0 0 0 0 0 0 cm.inc( fred flintstone, 3): h1 h2 h3 add the value to each hash location 0 3 0 3 0 0 0 3 0 cm.inc( barney rubble,5): h1 h2 h3 5 3 0 8 0 0 5 3 0 65
Count-min sketches split a real vector into k ranges, one for each hash function 0 0 0 0 0 0 0 0 0 cm.get( fred flintstone ): h1 h2 3 h3 take min when retrieving a value 5 3 0 8 0 0 5 3 0 cm.get( barney rubble): h1 h2 5 h3 5 3 0 8 0 0 5 3 0 66
Followup work (AIStats 2014) Propagating labels requires usually small number of optimization passes Basically like label propagation passes Each is linear in the number of edges and the number of labels being propagated the sketch size sketches can be combined linearly without unpacking them: sketch(av + bw) = a*sketch(v)+b*sketch(w) sketchs are good at storing skewed distributions 67
Followup work (AIStats 2014) Label distributions are often very skewed sparse initial labels community structure: labels from other subcommunities have small weight 68
Followup work (AIStats 2014) self-injection : similarity computation Freebase Flick-10k 69
Followup work (AIStats 2014) Freebase 70
Followup work (AIStats 2014) 100 Gb available 71
Even more recent work AIStats 2016 72
Differences: objective function seeds smoothness close to uniform label distribution normalized predictions 73
Differences: scaling up Updates done in parallel with Pregel Replace count-min sketch with streaming approach updates from neighbors are a stream break stream into sections maintain a list of (y, Prob(y), Δ) filter out labels and end of section if Prob(y)+Δ is small 74
Results with EXPANDER 75