Semi-supervised learning SSL (on graphs)

Semi-supervised learning SSL (on graphs) 1

Announcement No office hour for William after class today! 2

Semi-supervised learning Given: A pool of labeled examples L A (usually larger) pool of unlabeled examples U Option 1 for using L and U : Ignore U and use supervised learning on L Option 2: Ignore labels in L+U and use k-means, etc find clusters; then label each cluster using L Question: Can you use both L and U to do better? 3

SSL is Somewhere Between Clustering and Supervised Learning 4

SSL is Between Clustering and SL 5

What is a natural grouping among these objects? slides: Bhavana Dalvi 6

SSL is Between Clustering and SL clustering is unconstrained and may not give you what you want maybe this clustering is as good as the other 7

SSL is Between Clustering and SL 8

SSL is Between Clustering and SL 9

SSL is Between Clustering and SL supervised learning with few labels is also unconstrained and may not give you what you want 10

SSL is Between Clustering and SL 11

SSL is Between Clustering and SL This clustering isn t consistent with the labels 12

SSL is Between Clustering and SL 13

SSL in Action: The NELL System 14

Type of SSL Margin-based: transductive SVM Logistic regression with entropic regularization Generative: seeded k-means Nearest-neighbor like: graph-based SSL 15

Harmonic Fields aka coem aka wvrn 16

Idea: construct a graph connecting the most similar examples (k-nn graph) Intuition: nearby points should have similar labels labels should propagate through the graph Formalization: try and minimize energy defined as: In this example y is a length- 10 vector Harmonic fields Gharamani, Lafferty and Zhu Observed label 17

Result 1: at the minimal energy state, each node s value is a weighted average of its neighbor s weights: Harmonic fields Gharamani, Lafferty and Zhu Observed label 18

Harmonic field LP algorithm Result 2: you can reach the minimal energy state with a simple iterative algorithm: Step 1: For each seed example (x i,y i ): Let V 0 (i,c) = [ y i = c ] Step 2: for t=1,,t --- T is about 5 Let V t+1 (i,c) =weighted average of V t+1 (j,c) for all j that are linked to i, and renormalize V t +1 (i,c) = 1 Z j w i, j V t ( j,c) For seeds, reset V t+1 (i,c) = [ y i = c ] 19

Harmonic fields Gharamani, Lafferty and Zhu This family of techniques is called Label propagation 20

Harmonic fields Gharamani, Lafferty and Zhu This experiment points out some of the issues with LP: 1. What distance metric do you use? 2. What energy function do you minimize? 3. What is the right value for K in your K-NN graph? Is a K-NN graph right? 4. If you have lots of data, how expensive is it to build the graph? This family of techniques is called Label propagation 21

NELL: Uses Co-EM ~= HF Extract cities: Paris Pittsburgh Seattle Cupertino Examples San Francisco Austin denial anxiety selfishness Berlin mayor of arg1 live in arg1 arg1 is home of traits such as arg1 Features 22

Semi-Supervised Bootstrapped Learning via Label Propagation mayor of arg1 arg1 is home of Paris Pittsburgh San Francisco Austin anxiety live in arg1 traits such as arg1 Seattle denial selfishness 23

Semi-Supervised Bootstrapped Learning via Label Propagation mayor of arg1 arg1 is home of Paris Pittsburgh San Francisco Austin Information from other categories tells you anxiety how far (when to stop propagating) Seattle live in arg1 denial denial traits such as arg1 traits such as arg1 arrogance selfishness selfishness Nodes near seeds Nodes far from seeds 24

Difference: graph construction is not instance-to-instance but instance-to-feature Paris Pittsburgh San Francisco Austin Important reformulation: the k- NN graph is expensive to build, the instancefeature graph may not anxiety be Seattle denial selfishness 25

Some other general issues with SSL How much unlabeled data do you want? Suppose you re optimizing J = J L (L) + J U (U) If U >> L does J U dominate J? If so you re basically just clustering Often we need to balance J L and J U Besides L, what other information about the task is useful (or necessary)? Common choice: relative frequency of classes Various ways of incorporating this into the optimization problem 26

ASONAM-2010 (Advances in Social Networks Analysis and Mining) 27

Network Datasets with Known Classes UBMCBlog AGBlog MSPBlog Cora Citeseer 28

RWR - fixpoint of: aka Personalized PageRank Seed selection 1. order by PageRank, degree, or randomly 2. go down list until you have at least k examples/class 29

HF method Results Blog data Random Degree PageRank 30

Results More blog data Random Degree PageRank 31

Results Citation data Random Degree PageRank 32

Seeding MultiRankWalk 33

Seeding HF/wvRN 34

MultiRank Walk vs HF/wvRN/CoEM Seeds are marked S HF MRW 35

Back to Experiments: Network Datasets with Known Classes UBMCBlog AGBlog MSPBlog Cora Citeseer 36

MultiRankWalk vs wvrn/hf/coem 37

Harmonic Fields aka coem aka wvrn 38

CoEM/HF/wvRN One definition [MacKassey & Provost, JMLR 2007]: Simple relational classifier is same as the harmonic field the score of each node in the graph is the harmonic (linearly weighted) average of its neighbors scores. 39

CoEM/wvRN/HF Another justification of the same algorithm goes back to 2003 start with cotraining with a naïve Bayes learner 40

CoEM/wvRN/HF One algorithm with several justifications. One is to start with co-training with a naïve Bayes learner And compare to an EM version of naïve Bayes E: soft-classify unlabeled examples with NB classifier M: re-train classifier with soft-labeled examples 41

CoEM/wvRN/HF A second experiment each + example: concatenate features from two documents, one of class A+, one of class B+ each - example: concatenate features from two documents, one of class A-, one of class B- features are prefixed with A, B è disjoint 42

CoEM/wvRN/HF Co-training with a naïve Bayes learner vs an EM version of naïve Bayes E: soft-classify unlabeled examples with NB classifier M: re-train classifier with soft-labeled examples incremental hard assignments iterative soft assignments 44

Co-EM for a Rote Learner: equivalent to HF on a bipartite graph Pittsburgh NPs + + - - - + + + + + - - - - - contexts + + - - - - lives in _ 45

SSL AS OPTIMIZATION 46

SSL as optimization and Modified Adsorption slides from Partha Talukdar 47

yet another name for HF/wvRN/coEM 49

match seeds smoothness prior 50

Adsorption SSL algorithm 51

How to do this minimization? First, differentiate to find min is at Jacobi method: To solve Ax=b for x Iterate: or: 54

/HF/ precisionrecall break even point 57

/HF/ 58

/HF/ 59

from HTML tables on the web that are used for data, not formatting from mining patterns like musicians such as Bob Dylan 60

MAD SKETCHES 63

Followup work (AIStats 2014) Propagating labels requires usually small number of optimization passes Basically like label propagation passes Each is linear in the number of edges and the number of labels being propagated Can you do better? basic idea: store labels in a countmin sketch which is basically an compact approximation of an objectàdouble mapping 64

Count-min sketches split a real vector into k ranges, one for each hash function 0 0 0 0 0 0 0 0 0 cm.inc( fred flintstone, 3): h1 h2 h3 add the value to each hash location 0 3 0 3 0 0 0 3 0 cm.inc( barney rubble,5): h1 h2 h3 5 3 0 8 0 0 5 3 0 65

Count-min sketches split a real vector into k ranges, one for each hash function 0 0 0 0 0 0 0 0 0 cm.get( fred flintstone ): h1 h2 3 h3 take min when retrieving a value 5 3 0 8 0 0 5 3 0 cm.get( barney rubble): h1 h2 5 h3 5 3 0 8 0 0 5 3 0 66

Followup work (AIStats 2014) Propagating labels requires usually small number of optimization passes Basically like label propagation passes Each is linear in the number of edges and the number of labels being propagated the sketch size sketches can be combined linearly without unpacking them: sketch(av + bw) = a*sketch(v)+b*sketch(w) sketchs are good at storing skewed distributions 67

Followup work (AIStats 2014) Label distributions are often very skewed sparse initial labels community structure: labels from other subcommunities have small weight 68

Followup work (AIStats 2014) self-injection : similarity computation Freebase Flick-10k 69

Followup work (AIStats 2014) Freebase 70

Followup work (AIStats 2014) 100 Gb available 71

Even more recent work AIStats 2016 72

Differences: objective function seeds smoothness close to uniform label distribution normalized predictions 73

Differences: scaling up Updates done in parallel with Pregel Replace count-min sketch with streaming approach updates from neighbors are a stream break stream into sections maintain a list of (y, Prob(y), Δ) filter out labels and end of section if Prob(y)+Δ is small 74

Results with EXPANDER 75