Graph-based Semi- Supervised Learning as Optimization

Size: px

Start display at page:

Download "Graph-based Semi- Supervised Learning as Optimization"

Milo Barker
5 years ago
Views:

1 Graph-based Semi- Supervised Learning as Optimization Partha Pratim Talukdar CMU Machine Learning with Large Datasets (10-605) April 3, 2012

2 Graph-based Semi-Supervised Learning

3 Graph-based Semi-Supervised Learning Labeled (seed)

4 Graph-based Semi-Supervised Learning Unlabeled Labeled (seed)

5 Graph-based Semi-Supervised Learning

6 Graph-based Semi-Supervised Learning Various methods: LP (Zhu et al., 2003); QC (Bengio et al., 2007); Adsorption (Baluja et al., 2008)

7 Nota%ons Seed Scores v Label Priors Estimated Scores 4

8 Nota%ons Ŷ v,l : score of estimated label l on node v v Seed Scores Label Priors Estimated Scores 4

9 Nota%ons Ŷ v,l : score of estimated label l on node v Y v,l : score of seed label l on node v v Seed Scores Label Priors Estimated Scores 4

10 Nota%ons Ŷ v,l : score of estimated label l on node v Y v,l : score of seed label l on node v R v,l : regularization target for label l on node v v Seed Scores Label Priors Estimated Scores 4

11 Nota%ons Ŷ v,l : score of estimated label l on node v Y v,l : score of seed label l on node v R v,l : regularization target for label l on node v v Seed Scores Label Priors Estimated Scores S : seed node indicator (diagonal matrix) 4

12 Nota%ons Ŷ v,l : score of estimated label l on node v Y v,l : score of seed label l on node v R v,l : regularization target for label l on node v v Seed Scores Label Priors Estimated Scores S : seed node indicator (diagonal matrix) W uv : weight of edge (u, v) in the graph 4

13 LP- ZGL (Zhu et al., ICML 2003) 5

14 LP- ZGL (Zhu et al., ICML 2003) m arg min Ŷ l=1 W uv (Ŷul Ŷvl) 2 such that Y ul = Ŷul, S uu =1 5

15 LP- ZGL (Zhu et al., ICML 2003) m Smooth arg min Ŷ l=1 W uv (Ŷul Ŷvl) 2 such that Y ul = Ŷul, S uu =1 5

16 LP- ZGL (Zhu et al., ICML 2003) m Smooth arg min Ŷ l=1 W uv (Ŷul Ŷvl) 2 such that Y ul = Ŷul, S uu =1 Match Seeds (hard) 5

17 LP- ZGL (Zhu et al., ICML 2003) Smooth arg min Ŷ such that m l=1 W uv (Ŷul Ŷvl) 2 Y ul = Ŷul, S uu =1 Match Seeds (hard) = m l=1 Ŷ T l LŶl Graph Laplacian L = D - W (PSD) 5

18 LP- ZGL (Zhu et al., ICML 2003) Smooth arg min Ŷ m l=1 W uv (Ŷul Ŷvl) 2 = m l=1 Ŷ T l LŶl such that Y ul = Ŷul, S uu =1 Match Seeds (hard) Graph Laplacian L = D - W (PSD) Smoothness two nodes connected by an edge with high weight should be assigned similar labels 5

19 LP- ZGL (Zhu et al., ICML 2003) Smooth arg min Ŷ m l=1 W uv (Ŷul Ŷvl) 2 = m l=1 Ŷ T l LŶl such that Y ul = Ŷul, S uu =1 Match Seeds (hard) Graph Laplacian L = D - W (PSD) Smoothness two nodes connected by an edge with high weight should be assigned similar labels Solu7on sa7sfies harmonic property 5

20 Two Related Views Random Walk L U

21 Two Related Views Random Walk L U Label Diffusion L U

22 Random Walk View

23 Random Walk View what next? V U

24 Random Walk View what next? V U Continue walk with prob. p cont v Assign V s seed label to U with prob. p inj v Abandon random walk with prob. assign U a dummy label p abnd v

25 Discounting Nodes

26 Discounting Nodes Certain nodes can be unreliable (e.g., high degree nodes) do not allow propagation/walk through them

27 Discounting Nodes Certain nodes can be unreliable (e.g., high degree nodes) do not allow propagation/walk through them Solution: increase abandon probability on such nodes:

28 Discounting Nodes Certain nodes can be unreliable (e.g., high degree nodes) do not allow propagation/walk through them Solution: increase abandon probability on such nodes: p abnd v degree(v)

29 Redefining Matrices

30 Redefining Matrices W uv = p cont u W uv New Edge Weight

31 Redefining Matrices W uv = p cont u W uv New Edge Weight S uu = p inj u

32 Redefining Matrices W uv = p cont u W uv New Edge Weight S uu = p inj u R u = p abnd u, and 0 for non-dummy labels Dummy Label

33 10 Modified Adsorption (MAD) [Talukdar and Crammer, ECML 2009]

34 Modified Adsorption (MAD) [Talukdar and Crammer, ECML 2009] arg min Ŷ m+1 l=1 SŶ l SY l 2 + µ 1 u,v M uv (Ŷ ul Ŷ vl) 2 + µ 2 Ŷ l R l 2 m labels, +1 dummy label M = W + W is the symmetrized weight matrix Ŷ vl: weight of label l on node v Y vl : seed weight for label l on node v S: diagonal matrix, nonzero for seed nodes R vl : regularization target for label l on node v v Seed Scores Label Priors Estimated Scores 10

35 Modified Adsorption (MAD) [Talukdar and Crammer, ECML 2009] arg min Ŷ m+1 l=1 Match Seeds (soft) SŶ l SY l 2 + µ 1 u,v M uv (Ŷ ul Ŷ vl) 2 + µ 2 Ŷ l R l 2 m labels, +1 dummy label M = W + W is the symmetrized weight matrix Ŷ vl: weight of label l on node v Y vl : seed weight for label l on node v S: diagonal matrix, nonzero for seed nodes R vl : regularization target for label l on node v v Seed Scores Label Priors Estimated Scores 10

36 Modified Adsorption (MAD) [Talukdar and Crammer, ECML 2009] arg min Ŷ m+1 l=1 Match Seeds (soft) SŶ l SY l 2 + µ 1 u,v Smooth M uv (Ŷ ul Ŷ vl) 2 + µ 2 Ŷ l R l 2 m labels, +1 dummy label M = W + W is the symmetrized weight matrix Ŷ vl: weight of label l on node v Y vl : seed weight for label l on node v S: diagonal matrix, nonzero for seed nodes R vl : regularization target for label l on node v v Seed Scores Label Priors Estimated Scores 10

37 Modified Adsorption (MAD) arg min Ŷ m+1 l=1 [Talukdar and Crammer, ECML 2009] Match Seeds (soft) SŶ l SY l 2 + µ 1 u,v Match Priors (Regularizer) Smooth M uv (Ŷ ul Ŷ vl) 2 + µ 2 Ŷ l R l 2 m labels, +1 dummy label M = W + W is the symmetrized weight matrix Ŷ vl: weight of label l on node v Y vl : seed weight for label l on node v S: diagonal matrix, nonzero for seed nodes R vl : regularization target for label l on node v v Seed Scores Label Priors Estimated Scores 10

38 Modified Adsorption (MAD) arg min Ŷ m+1 l=1 [Talukdar and Crammer, ECML 2009] Match Seeds (soft) SŶ l SY l 2 + µ 1 u,v Match Priors (Regularizer) Smooth M uv (Ŷ ul Ŷ vl) 2 + µ 2 Ŷ l R l 2 m labels, +1 dummy label M = W used + W to indicate is the symmetrized none-of-the-above weight label matrix Ŷ vl: weight of label l on node v Y vl : seed weight for label l on node v S: diagonal matrix, nonzero for seed nodes R vl : regularization target for label l on node v v Seed Scores Label Priors Estimated Scores 10

Modified Adsorption (MAD) arg min Ŷ m+1 l=1 [Talukdar and Crammer, ECML 2009] Match Seeds (soft) SŶ l SY l 2 + µ 1 u,v Match Priors (Regularizer) Smooth M uv (Ŷ ul Ŷ vl) 2 + µ 2 Ŷ l R l 2 m labels,

39 Modified Adsorption (MAD) arg min Ŷ m+1 l=1 [Talukdar and Crammer, ECML 2009] Match Seeds (soft) SŶ l SY l 2 + µ 1 u,v Match Priors (Regularizer) Smooth M uv (Ŷ ul Ŷ vl) 2 + µ 2 Ŷ l R l 2 m labels, +1 dummy label M = W used + W to indicate is the symmetrized none-of-the-above weight label matrix Ŷ vl: weight of label l on node v Y vl : seed weight for label l on node v S: diagonal matrix, nonzero for seed nodes R vl : regularization target for label l on node v v Seed Scores Label Priors Estimated Scores 10 MAD has extra regularization compared to LP-ZGL (Zhu et al, ICML 03); similar to QC (Bengio et al, 2006)

40 MAD (contd.) Inputs Y, R : V ( L + 1), W : V V, S : V V diagonal Ŷ Y M = W + W Z v S vv + µ 1 u=v M vu + µ 2 v V repeat for all v V do Ŷ v 1 (SY ) v + µ 1 M v Ŷ + µ 2 R v Z v end for until convergence 11

41 MAD (contd.) Inputs Y, R : V ( L + 1), W : V V, S : V V diagonal Ŷ Y M = W + W Z v S vv + µ 1 u=v M vu + µ 2 v V repeat for all v V do Ŷ v 1 (SY ) v + µ 1 M v Ŷ + µ 2 R v Z v end for until convergence Extends Adsorption with well-defined optimization Importance of a node can be discounted Easily Parallelizable: Scalable 11

42 MapReduce Implementation of MAD 12

43 MapReduce Implementation of MAD Map Each node send its current label assignments to its neighbors 12

44 MapReduce Implementation of MAD Map Each node send its current label assignments to its neighbors Reduce Each node updates its own label assignment using messages received from neighbors, and its own information (e.g., seed labels, reg. penalties etc.) 12

45 MapReduce Implementation of MAD Map Each node send its current label assignments to its neighbors Reduce Each node updates its own label assignment using messages received from neighbors, and its own information (e.g., seed labels, reg. penalties etc.) Repeat until convergence 12

46 13 Experiments

47 Text Classification

48 Text Classification PRBEP (macro-averaged) on WebKB Dataset, 3148 test instances

49 Sentiment Classification

50 Sentiment Classification Precision on 3568 Sentiment test instances

51 16 Class-Instance Acquisition

52 Class-Instance Acquisition Graph with 303k nodes, 2.3m edges. 16

53 Class-Instance Acquisition 0.39 Freebase-2 Graph, 192 WordNet Classes LP-ZGL Adsorption MAD Mean Reciprocal Rank (MRR) Graph with 303k nodes, 2.3m edges x x 10 Amount of Supervision 16

54 New (Class, Instance) Pairs Found Class Scientific Journals NFL Players Book Publishers A few non-seed Instances found by Adsorption Journal of Physics, Nature, Structural and Molecular Biology, Sciences Sociales et sante, Kidney and Blood Pressure Research, American Journal of Physiology-Cell Physiology, Tony Gonzales, Thabiti Davis, Taylor Stubblefield, Ron Dixon, Rodney Hannan, Small Night Shade Books, House of Ansari Press, Highwater Books, Distributed Art Publishers, Cooper Canyon Press, 17 Total classes: 9081

55 18 When is MAD most effective?

56 When is MAD most effective? 0.4 Relative Increase in MRR by MAD over LP-ZGL Average Degree 18

When is MAD most effective? Relative Increase in MRR by MAD over LP-ZGL 0.4 0.

57 When is MAD most effective? Relative Increase in MRR by MAD over LP-ZGL MAD seems to be more effective in graphs 0.2 with high average degree, where there is greater need for regularization Average Degree 18

58 Benefits of Having an Optimization Framework 19

59 Extension to Dependent Labels Labels are not always mutually exclusive.

60 Extension to Dependent Labels Labels are not always mutually exclusive. White ScotchAle BrownAle PaleAle Ale 0.95 TopFormentedBeer 0.8 Porter

61 Extension to Dependent Labels Labels are not always mutually exclusive. White ScotchAle BrownAle PaleAle Ale 0.95 TopFormentedBeer 0.8 Labels Porter Label Similarity Label Graph

62 MAD with Dependent Labels (MADDL) min Seed Label Loss (if any) Edge Smoothness Loss + + Label Prior Loss (e.g. prior on dummy label)

63 MAD with Dependent Labels (MADDL) min Seed Label Edge Label Prior Loss Loss + Smoothness + (e.g. prior on + Dependent Label Loss (if any) Loss dummy label)

64 MAD with Dependent Labels (MADDL) MADDL Objective min Seed Label Edge Label Prior Loss Loss + Smoothness + (e.g. prior on + Dependent Label Loss (if any) Loss dummy label)

65 MAD with Dependent Labels (MADDL) MADDL Objective min Seed Label Edge Label Prior Loss Loss + Smoothness + (e.g. prior on + Dependent Label Loss (if any) Loss dummy label) Penalize if similar labels are assigned different scores on a node

66 MAD with Dependent Labels (MADDL) MADDL Objective min Seed Label Edge Label Prior Loss Loss + Smoothness + (e.g. prior on + Dependent Label Loss (if any) Loss dummy label) Penalize if similar labels are assigned different scores on a node BrownAle 1.0 Ale

67 MAD with Dependent Labels (MADDL) MADDL Objective min Seed Label Edge Label Prior Loss Loss + Smoothness + (e.g. prior on + Dependent Label Loss (if any) Loss dummy label) MADDL objective results in a scalable iterative update, with convergence guarantee. Penalize if similar labels are assigned different scores on a node BrownAle 1.0 Ale

68 Smooth Sentiment Ranking rank 1 rank 4 smooth predictions

69 Smooth Sentiment Ranking rank 1 rank 4 smooth predictions non-smooth predictions

70 Smooth Sentiment Ranking rank 1 Prefer over rank 4 smooth predictions non-smooth predictions

71 Smooth Sentiment Ranking rank 1 Prefer over rank 4 smooth predictions non-smooth predictions 1.0 MADDL Label Constraints 1.0

72 Smooth Sentiment Ranking Count of Top Predicted Pair in MAD Output Label Label 1 4

73 Smooth Sentiment Ranking Count of Top Predicted Pair in MAD Output Count of Top Predicted Pair in MADDL Output Label Label Label Label 1 4

74 Smooth Sentiment Ranking Count of Top Predicted Pair in MAD Output Count of Top Predicted Pair in MADDL Output Label Label Label Label 1 4 MADDL generates smoother ranking, while preserving accuracy of prediction.

75 Other Methods and Graph Construction

76 Other Methods and Graph Construction Measure Propagation [Subramanya and Bilmes, NIPS 2009]

77 Other Methods and Graph Construction Measure Propagation [Subramanya and Bilmes, NIPS 2009] - KL-based loss

78 Other Methods and Graph Construction Measure Propagation [Subramanya and Bilmes, NIPS 2009] - KL-based loss - Results over 120m node graph

79 Other Methods and Graph Construction Measure Propagation [Subramanya and Bilmes, NIPS 2009] - KL-based loss - Results over 120m node graph Chapter 21 of the SSL book [Chappelle et al.]

80 Other Methods and Graph Construction Measure Propagation [Subramanya and Bilmes, NIPS 2009] - KL-based loss - Results over 120m node graph Chapter 21 of the SSL book [Chappelle et al.] Graph Construction

81 Other Methods and Graph Construction Measure Propagation [Subramanya and Bilmes, NIPS 2009] - KL-based loss - Results over 120m node graph Chapter 21 of the SSL book [Chappelle et al.] Graph Construction Task-specific Graph Construction using Metric Learning [Dhillon, Talukdar, and Crammer, ACL 2010]

82 25 Graph-based SSL: Final Thoughts

83 Graph-based SSL: Final Thoughts Provide flexible representation for both relational and IID data 25

84 Graph-based SSL: Final Thoughts Provide flexible representation for both relational and IID data Scalable: MR/Hadoop-based implementation 25

85 Graph-based SSL: Final Thoughts Provide flexible representation for both relational and IID data Scalable: MR/Hadoop-based implementation Can handle labeled as well as unlabeled data 25

86 Graph-based SSL: Final Thoughts Provide flexible representation for both relational and IID data Scalable: MR/Hadoop-based implementation Can handle labeled as well as unlabeled data Can handle multiple-classes 25

87 Graph-based SSL: Final Thoughts Provide flexible representation for both relational and IID data Scalable: MR/Hadoop-based implementation Can handle labeled as well as unlabeled data Can handle multiple-classes Effective in practice 25

89 Code in Junto Label Propagation Toolkit (includes Hadoop-based implementation)

90 Code in Junto Label Propagation Toolkit (includes Hadoop-based implementation) Thanks!

(Graph-based) Semi-Supervised Learning. Partha Pratim Talukdar Indian Institute of Science

(Graph-based) Semi-Supervised Learning Partha Pratim Talukdar Indian Institute of Science ppt@serc.iisc.in April 7, 2015 Supervised Learning Labeled Data Learning Algorithm Model 2 Supervised Learning