Semi-supervised learning SSL (on graphs)

Size: px

Start display at page:

Download "Semi-supervised learning SSL (on graphs)"

Elisabeth Gilmore
6 years ago
Views:

1 Semi-supervised learning SSL (on graphs) 1

2 Announcement No office hour for William after class today! 2

3 Semi-supervised learning Given: A pool of labeled examples L A (usually larger) pool of unlabeled examples U Option 1 for using L and U : Ignore U and use supervised learning on L Option 2: Ignore labels in L+U and use k-means, etc find clusters; then label each cluster using L Question: Can you use both L and U to do better? 3

4 SSL is Somewhere Between Clustering and Supervised Learning 4

5 SSL is Between Clustering and SL 5

6 What is a natural grouping among these objects? slides: Bhavana Dalvi 6

7 SSL is Between Clustering and SL clustering is unconstrained and may not give you what you want maybe this clustering is as good as the other 7

8 SSL is Between Clustering and SL 8

9 SSL is Between Clustering and SL 9

10 SSL is Between Clustering and SL supervised learning with few labels is also unconstrained and may not give you what you want 10

11 SSL is Between Clustering and SL 11

12 SSL is Between Clustering and SL This clustering isn t consistent with the labels 12

13 SSL is Between Clustering and SL 13

14 SSL in Action: The NELL System 14

15 Type of SSL Margin-based: transductive SVM Logistic regression with entropic regularization Generative: seeded k-means Nearest-neighbor like: graph-based SSL 15

16 Harmonic Fields aka coem aka wvrn 16

17 Idea: construct a graph connecting the most similar examples (k-nn graph) Intuition: nearby points should have similar labels labels should propagate through the graph Formalization: try and minimize energy defined as: In this example y is a length- 10 vector Harmonic fields Gharamani, Lafferty and Zhu Observed label 17

18 Result 1: at the minimal energy state, each node s value is a weighted average of its neighbor s weights: Harmonic fields Gharamani, Lafferty and Zhu Observed label 18

19 Harmonic field LP algorithm Result 2: you can reach the minimal energy state with a simple iterative algorithm: Step 1: For each seed example (x i,y i ): Let V 0 (i,c) = [ y i = c ] Step 2: for t=1,,t --- T is about 5 Let V t+1 (i,c) =weighted average of V t+1 (j,c) for all j that are linked to i, and renormalize V t +1 (i,c) = 1 Z j w i, j V t ( j,c) For seeds, reset V t+1 (i,c) = [ y i = c ] 19

20 Harmonic fields Gharamani, Lafferty and Zhu This family of techniques is called Label propagation 20

21 Harmonic fields Gharamani, Lafferty and Zhu This experiment points out some of the issues with LP: 1. What distance metric do you use? 2. What energy function do you minimize? 3. What is the right value for K in your K-NN graph? Is a K-NN graph right? 4. If you have lots of data, how expensive is it to build the graph? This family of techniques is called Label propagation 21

22 NELL: Uses Co-EM ~= HF Extract cities: Paris Pittsburgh Seattle Cupertino Examples San Francisco Austin denial anxiety selfishness Berlin mayor of arg1 live in arg1 arg1 is home of traits such as arg1 Features 22

23 Semi-Supervised Bootstrapped Learning via Label Propagation mayor of arg1 arg1 is home of Paris Pittsburgh San Francisco Austin anxiety live in arg1 traits such as arg1 Seattle denial selfishness 23

24 Semi-Supervised Bootstrapped Learning via Label Propagation mayor of arg1 arg1 is home of Paris Pittsburgh San Francisco Austin Information from other categories tells you anxiety how far (when to stop propagating) Seattle live in arg1 denial denial traits such as arg1 traits such as arg1 arrogance selfishness selfishness Nodes near seeds Nodes far from seeds 24

Important reformulation: the k- NN graph is expensive to build,

25 Difference: graph construction is not instance-to-instance but instance-to-feature Paris Pittsburgh San Francisco Austin Important reformulation: the k- NN graph is expensive to build, the instancefeature graph may not anxiety be Seattle denial selfishness 25

26 Some other general issues with SSL How much unlabeled data do you want? Suppose you re optimizing J = J L (L) + J U (U) If U >> L does J U dominate J? If so you re basically just clustering Often we need to balance J L and J U Besides L, what other information about the task is useful (or necessary)? Common choice: relative frequency of classes Various ways of incorporating this into the optimization problem 26

27 ASONAM-2010 (Advances in Social Networks Analysis and Mining) 27

28 Network Datasets with Known Classes UBMCBlog AGBlog MSPBlog Cora Citeseer 28

29 RWR - fixpoint of: aka Personalized PageRank Seed selection 1. order by PageRank, degree, or randomly 2. go down list until you have at least k examples/class 29

30 HF method Results Blog data Random Degree PageRank 30

31 Results More blog data Random Degree PageRank 31

32 Results Citation data Random Degree PageRank 32

33 Seeding MultiRankWalk 33

34 Seeding HF/wvRN 34

35 MultiRank Walk vs HF/wvRN/CoEM Seeds are marked S HF MRW 35

36 Back to Experiments: Network Datasets with Known Classes UBMCBlog AGBlog MSPBlog Cora Citeseer 36

37 MultiRankWalk vs wvrn/hf/coem 37

38 Harmonic Fields aka coem aka wvrn 38

39 CoEM/HF/wvRN One definition [MacKassey & Provost, JMLR 2007]: Simple relational classifier is same as the harmonic field the score of each node in the graph is the harmonic (linearly weighted) average of its neighbors scores. 39

40 CoEM/wvRN/HF Another justification of the same algorithm goes back to 2003 start with cotraining with a naïve Bayes learner 40

41 CoEM/wvRN/HF One algorithm with several justifications. One is to start with co-training with a naïve Bayes learner And compare to an EM version of naïve Bayes E: soft-classify unlabeled examples with NB classifier M: re-train classifier with soft-labeled examples 41

42 CoEM/wvRN/HF A second experiment each + example: concatenate features from two documents, one of class A+, one of class B+ each - example: concatenate features from two documents, one of class A-, one of class B- features are prefixed with A, B è disjoint 42

43 CoEM/wvRN/HF A second experiment each + example: concatenate features from two documents, one of class A+, one of class B+ each - example: concatenate features from two documents, one of class A-, one of class B- features are prefixed with A, B è disjoint NOW co-training outperforms EM 43

44 CoEM/wvRN/HF Co-training with a naïve Bayes learner vs an EM version of naïve Bayes E: soft-classify unlabeled examples with NB classifier M: re-train classifier with soft-labeled examples incremental hard assignments iterative soft assignments 44

45 Co-EM for a Rote Learner: equivalent to HF on a bipartite graph Pittsburgh NPs contexts lives in _ 45

46 SSL AS OPTIMIZATION 46

47 SSL as optimization and Modified Adsorption slides from Partha Talukdar 47

48 48

49 yet another name for HF/wvRN/coEM 49

50 match seeds smoothness prior 50

51 Adsorption SSL algorithm 51

52 52

53 53

54 How to do this minimization? First, differentiate to find min is at Jacobi method: To solve Ax=b for x Iterate: or: 54

55 55

56 56

57 /HF/ precisionrecall break even point 57

58 /HF/ 58

59 /HF/ 59

60 from HTML tables on the web that are used for data, not formatting from mining patterns like musicians such as Bob Dylan 60

61 61

62 62

63 MAD SKETCHES 63

64 Followup work (AIStats 2014) Propagating labels requires usually small number of optimization passes Basically like label propagation passes Each is linear in the number of edges and the number of labels being propagated Can you do better? basic idea: store labels in a countmin sketch which is basically an compact approximation of an objectàdouble mapping 64

Count-min sketches split a real vector into k ranges, one for each hash function 0 0 0 0 0 0 0 0 0 cm.

65 Count-min sketches split a real vector into k ranges, one for each hash function cm.inc( fred flintstone, 3): h1 h2 h3 add the value to each hash location cm.inc( barney rubble,5): h1 h2 h

66 Count-min sketches split a real vector into k ranges, one for each hash function cm.get( fred flintstone ): h1 h2 3 h3 take min when retrieving a value cm.get( barney rubble): h1 h2 5 h

67 Followup work (AIStats 2014) Propagating labels requires usually small number of optimization passes Basically like label propagation passes Each is linear in the number of edges and the number of labels being propagated the sketch size sketches can be combined linearly without unpacking them: sketch(av + bw) = a*sketch(v)+b*sketch(w) sketchs are good at storing skewed distributions 67

68 Followup work (AIStats 2014) Label distributions are often very skewed sparse initial labels community structure: labels from other subcommunities have small weight 68

69 Followup work (AIStats 2014) self-injection : similarity computation Freebase Flick-10k 69

70 Followup work (AIStats 2014) Freebase 70

71 Followup work (AIStats 2014) 100 Gb available 71

72 Even more recent work AIStats

73 Differences: objective function seeds smoothness close to uniform label distribution normalized predictions 73

74 Differences: scaling up Updates done in parallel with Pregel Replace count-min sketch with streaming approach updates from neighbors are a stream break stream into sections maintain a list of (y, Prob(y), Δ) filter out labels and end of section if Prob(y)+Δ is small 74

75 Results with EXPANDER 75

INTRO TO SEMI-SUPERVISED LEARNING (SSL)

SSL (on graphs) 1 INTRO TO SEMI-SUPERVISED LEARNING (SSL) Semi-supervised learning Given: A pool of labeled examples L A (usually larger) pool of unlabeled examples U Option 1 for using L and U : Ignore