(Graph-based) Semi-Supervised Learning. Partha Pratim Talukdar Indian Institute of Science

Size: px

Start display at page:

Download "(Graph-based) Semi-Supervised Learning. Partha Pratim Talukdar Indian Institute of Science"

Kristopher Dixon
5 years ago
Views:

1 (Graph-based) Semi-Supervised Learning Partha Pratim Talukdar Indian Institute of Science April 7, 2015

2 Supervised Learning Labeled Data Learning Algorithm Model 2

3 Supervised Learning Labeled Data Learning Algorithm Model Examples: Decision Trees Support Vector Machine (SVM) Maximum Entropy (MaxEnt) 2

4 Semi-Supervised Learning (SSL) Labeled Data Learning Algorithm Model A Lot of Unlabeled Data 3

5 Semi-Supervised Learning (SSL) Labeled Data Learning Algorithm Model A Lot of Unlabeled Data Examples: Self-Training Co-Training 3

6 Why SSL? How can unlabeled data be helpful? 4

7 Why SSL? How can unlabeled data be helpful? Without Unlabeled Data 4

8 Why SSL? How can unlabeled data be helpful? Labeled Instances Without Unlabeled Data 4

9 Why SSL? How can unlabeled data be helpful? Labeled Instances Decision Boundary Without Unlabeled Data 4

10 Why SSL? How can unlabeled data be helpful? Labeled Instances Decision Boundary Without Unlabeled Data With Unlabeled Data 4

11 Why SSL? How can unlabeled data be helpful? Labeled Instances Decision Boundary Without Unlabeled Data With Unlabeled Data Unlabeled Instances 4

presence of unlabeled instances Decision Boundary Without

12 Why SSL? How can unlabeled data be helpful? Labeled Instances More accurate decision boundary in the presence of unlabeled instances Decision Boundary Without Unlabeled Data With Unlabeled Data Unlabeled Instances Example from [Belkin et al., JMLR 2006] 4

13 Inductive vs Transductive 5

14 Inductive vs Transductive Supervised (Labeled) Semi-supervised (Labeled + Unlabeled) 5

15 Inductive vs Transductive Inductive (Generalize to Unseen Data) Transductive (Doesn t Generalize to Unseen Data) Supervised (Labeled) Semi-supervised (Labeled + Unlabeled) 5

16 Inductive vs Transductive Inductive (Generalize to Unseen Data) Transductive (Doesn t Generalize to Unseen Data) Supervised (Labeled) SVM, Maximum Entropy Semi-supervised (Labeled + Unlabeled) 5

17 Inductive vs Transductive Inductive (Generalize to Unseen Data) Transductive (Doesn t Generalize to Unseen Data) Supervised (Labeled) SVM, Maximum Entropy X Semi-supervised (Labeled + Unlabeled) 5

18 Inductive vs Transductive Inductive (Generalize to Unseen Data) Transductive (Doesn t Generalize to Unseen Data) Supervised (Labeled) SVM, Maximum Entropy X Semi-supervised (Labeled + Unlabeled) Manifold Regularization 5

19 Inductive vs Transductive Inductive (Generalize to Unseen Data) Transductive (Doesn t Generalize to Unseen Data) Supervised (Labeled) SVM, Maximum Entropy X Semi-supervised (Labeled + Unlabeled) Manifold Regularization Label Propagation (LP), MAD, MP,... 5

20 Inductive vs Transductive Inductive (Generalize to Unseen Data) Transductive (Doesn t Generalize to Unseen Data) Supervised (Labeled) SVM, Maximum Entropy X Semi-supervised (Labeled + Unlabeled) Manifold Regularization Label Propagation (LP), MAD, MP,... Most Graph SSL algorithms are non-parametric (i.e., # parameters grows with data size) 5

21 Inductive vs Transductive Inductive (Generalize to Unseen Data) Transductive (Doesn t Generalize to Unseen Data) Supervised (Labeled) SVM, Maximum Entropy X Focus of this talk Semi-supervised (Labeled + Unlabeled) Manifold Regularization Label Propagation (LP), MAD, MP,... Most Graph SSL algorithms are non-parametric (i.e., # parameters grows with data size) 5

22 Inductive vs Transductive Inductive (Generalize to Unseen Data) Transductive (Doesn t Generalize to Unseen Data) Supervised (Labeled) SVM, Maximum Entropy X Focus of this talk Semi-supervised (Labeled + Unlabeled) Manifold Regularization Label Propagation (LP), MAD, MP,... Most Graph SSL algorithms are non-parametric (i.e., # parameters grows with data size) See Chapter 25 of SSL Book: 5

23 Two Popular SSL Algorithms Self Training 6

24 Two Popular SSL Algorithms Self Training Co-Training 6

25 Why Graph-based SSL? 7

26 Why Graph-based SSL? Some datasets are naturally represented by a graph web, citation network, social network,... 7

27 Why Graph-based SSL? Some datasets are naturally represented by a graph web, citation network, social network,... Uniform representation for heterogeneous data 7

28 Why Graph-based SSL? Some datasets are naturally represented by a graph web, citation network, social network,... Uniform representation for heterogeneous data Easily parallelizable, scalable to large data 7

29 Why Graph-based SSL? Some datasets are naturally represented by a graph web, citation network, social network,... Uniform representation for heterogeneous data Easily parallelizable, scalable to large data Effective in practice 7

30 Why Graph-based SSL? Some datasets are naturally represented by a graph web, citation network, social network,... Uniform representation for heterogeneous data Easily parallelizable, scalable to large data Effective in practice Graph SSL Non-Graph SSL Text Classification Supervised 7

31 Graph-based SSL 8

32 Graph-based SSL 8

33 Graph-based SSL Similarity

34 Graph-based SSL Similarity business politics 8

35 Graph-based SSL Similarity business politics business politics 8

36 Graph-based SSL 9

37 Graph-based SSL Smoothness Assumption If two instances are similar according to the graph, then output labels should be similar 9

38 Graph-based SSL Smoothness Assumption If two instances are similar according to the graph, then output labels should be similar 9

39 Graph-based SSL Smoothness Assumption If two instances are similar according to the graph, then output labels should be similar Two stages Graph construction (if not already present) Label Inference 9

40 Outline Motivation Graph Construction Inference Methods Scalability Applications Conclusion & Future Work 10

41 Outline Motivation Graph Construction Inference Methods Scalability Applications Conclusion & Future Work 10

42 Graph Construction Neighborhood Methods k-nn Graph Construction (k-nng) e-neighborhood Method Metric Learning Other approaches 11

43 Neighborhood Methods 12

44 Neighborhood Methods k-nearest Neighbor Graph (k-nng) add edges between an instance and its k-nearest neighbors 12

45 Neighborhood Methods k-nearest Neighbor Graph (k-nng) add edges between an instance and its k-nearest neighbors k = 3 12

46 Neighborhood Methods k-nearest Neighbor Graph (k-nng) add edges between an instance and its k-nearest neighbors k = 3 e-neighborhood add edges to all instances inside a ball of radius e 12

47 Neighborhood Methods k-nearest Neighbor Graph (k-nng) add edges between an instance and its k-nearest neighbors k = 3 e-neighborhood add edges to all instances inside a ball of radius e e 12

48 Issues with k-nng 13

49 Issues with k-nng Not scalable (quadratic) 13

50 Issues with k-nng Not scalable (quadratic) Results in an asymmetric graph 13

51 Issues with k-nng Not scalable (quadratic) Results in an asymmetric graph b is the closest neighbor of a, but not the other way a b c 13

52 Issues with k-nng Not scalable (quadratic) Results in an asymmetric graph b is the closest neighbor of a, but not the other way a b c Results in irregular graphs some nodes may end up with higher degree than other nodes 13

53 Issues with k-nng Not scalable (quadratic) Results in an asymmetric graph b is the closest neighbor of a, but not the other way a b c Results in irregular graphs some nodes may end up with higher degree than other nodes Node of degree 4 in the k-nng (k = 1) 13

54 Issues with e-neighborhood 14

55 Issues with e-neighborhood Not scalable 14

56 Issues with e-neighborhood Not scalable Sensitive to value of e : not invariant to scaling 14

57 Issues with e-neighborhood Not scalable Sensitive to value of e : not invariant to scaling Fragmented Graph: disconnected components 14

58 Issues with e-neighborhood Not scalable Sensitive to value of e : not invariant to scaling Fragmented Graph: disconnected components Disconnected Data e-neighborhood Figure from [Jebara et al., ICML 2009] 14

59 Graph Construction using Metric Learning 15

60 Graph Construction using Metric Learning x i w ij / exp( D A (x i,x j )) x j 15

61 Graph Construction using Metric Learning x i w ij / exp( D A (x i,x j )) x j D A (x i,x j )=(x i x j ) T A(x i x j ) Estimated using Mahalanobis metric learning algorithms 15

62 Graph Construction using Metric Learning x i w ij / exp( D A (x i,x j )) x j D A (x i,x j )=(x i x j ) T A(x i x j ) Supervised Metric Learning ITML [Kulis et al., ICML 2007] LMNN [Weinberger and Saul, JMLR 2009] Estimated using Mahalanobis metric learning algorithms Semi-supervised Metric Learning IDML [Dhillon et al., UPenn TR 2010] 15

63 Benefits of Metric Learning for Graph Construction 16

64 Benefits of Metric Learning for Graph Construction 0.5 Original RP PCA ITML IDML Error Amazon Newsgroups Reuters Enron A Text 100 seed and1400 test instances, all inferences using LP 16

65 Benefits of Metric Learning for Graph Construction Original RP PCA ITML IDML Graph constructed using supervised metric learning Error Amazon Newsgroups Reuters Enron A Text 100 seed and1400 test instances, all inferences using LP 16

66 Benefits of Metric Learning for Graph Construction Original RP PCA ITML IDML Graph constructed using supervised metric learning Graph constructed using semi-supervised metric learning [Dhillon et al., 2010] Error Amazon Newsgroups Reuters Enron A Text 100 seed and1400 test instances, all inferences using LP 16 [Dhillon et al., UPenn TR 2010]

67 Benefits of Metric Learning for Graph Construction Original RP PCA ITML IDML Graph constructed using supervised metric learning Graph constructed using semi-supervised metric learning [Dhillon et al., 2010] Error Amazon Newsgroups Reuters Enron A Text 100 seed and1400 test instances, all inferences using LP Careful graph construction is critical! 16 [Dhillon et al., UPenn TR 2010]

68 Other Graph Construction Approaches Local Reconstruction Linear Neighborhood [Wang and Zhang, ICML 2005] Regular Graph: b-matching [Jebara et al., ICML 2008] Fitting Graph to Vector Data [Daitch et al., ICML 2009] Graph Kernels [Zhu et al., NIPS 2005] 17

69 Outline Motivation Graph Construction Inference Methods Scalability - Label Propagation - Modified Adsorption - Measure Propagation - Sparse Label Propagation - Manifold Regularization Applications Conclusion & Future Work 18

70 Graph Laplacian 19

71 Graph Laplacian Laplacian (un-normalized) of a graph: L = D W, where D ii = X j W ij, D ij(6=i) =0 19

72 Graph Laplacian Laplacian (un-normalized) of a graph: L = D W, where D ii = X j W ij, D ij(6=i) =0 1 b 3 c a 2 1 d 19

73 Graph Laplacian Laplacian (un-normalized) of a graph: L = D W, where D ii = X j W ij, D ij(6=i) =0 a b c d a b c d a b 2 3 c 1 d 19

74 1 Graph Laplacian (contd.) L is positive semi-definite (assuming non-negative weights) Smoothness of prediction f over the graph in terms of the Laplacian: 20

75 1 Graph Laplacian (contd.) L is positive semi-definite (assuming non-negative weights) Smoothness of prediction f over the graph in terms of the Laplacian: f T Lf = X i,j W ij (f i f j ) 2 20

76 1 Graph Laplacian (contd.) L is positive semi-definite (assuming non-negative weights) Smoothness of prediction f over the graph in terms of the Laplacian: Measure of Non-Smoothness f T Lf = X i,j W ij (f i f j ) 2 20

77 1 Graph Laplacian (contd.) L is positive semi-definite (assuming non-negative weights) Smoothness of prediction f over the graph in terms of the Laplacian: Vector of scores for single label on nodes f T Lf = X i,j W ij (f i f j ) 2 Measure of Non-Smoothness 20

78 Graph Laplacian (contd.) L is positive semi-definite (assuming non-negative weights) Smoothness of prediction f over the graph in terms of the Laplacian: Vector of scores for single label on nodes f T Lf = X i,j W ij (f i f j ) 2 Measure of Non-Smoothness 10 b 3 1 c 5 1 a 2 1 d 25 1 f T = [ ] 20

79 Graph Laplacian (contd.) L is positive semi-definite (assuming non-negative weights) Smoothness of prediction f over the graph in terms of the Laplacian: Vector of scores for single label on nodes f T Lf = X i,j W ij (f i f j ) 2 Measure of Non-Smoothness 10 b 3 1 c 5 1 a 2 1 d 25 1 f T = [ ] f T Lf = 588 Not Smooth 20

80 Graph Laplacian (contd.) L is positive semi-definite (assuming non-negative weights) Smoothness of prediction f over the graph in terms of the Laplacian: Vector of scores for single label on nodes W ij (f i f j ) 2 Measure of Non-Smoothness 1 1 b 3 f T Lf = X i,j c b 3 c 5 1 a 2 f T = [1113] f T Lf =4 1 d 3 1 a 2 1 d 25 f T = [ ] f T Lf = 588 Not Smooth 1 20

81 Graph Laplacian (contd.) L is positive semi-definite (assuming non-negative weights) Smoothness of prediction f over the graph in terms of the Laplacian: Vector of scores for single label on nodes W ij (f i f j ) 2 Measure of Non-Smoothness 1 1 b 3 f T Lf = X i,j c b 3 c 5 1 a 2 f T = [1113] f T Lf =4 1 d 3 Smooth 20 1 a 2 1 d 25 f T = [ ] f T Lf = 588 Not Smooth 1

82 Outline Motivation Graph Construction Inference Methods Scalability - Label Propagation - Modified Adsorption - Measure Propagation - Sparse Label Propagation - Manifold Regularization Applications Conclusion & Future Work 21

83 Notations Y v,l : score of seed label l on node v Ŷ v,l : score of estimated label l on node v R v,l : regularization target for label l on node v v Seed Scores Label Regularization Estimated Scores S : seed node indicator (diagonal matrix) W uv : weight of edge (u, v) in the graph 22

84 LP-ZGL [Zhu et al., ICML 2003] arg min Ŷ m l=1 W uv (Ŷul Ŷvl) 2 = m l=1 Ŷ T l LŶl such that Y ul = Ŷul, S uu =1 Graph Laplacian 23

85 LP-ZGL [Zhu et al., ICML 2003] Smooth arg min Ŷ m l=1 W uv (Ŷul Ŷvl) 2 = m l=1 Ŷ T l LŶl such that Y ul = Ŷul, S uu =1 Graph Laplacian 23

86 LP-ZGL [Zhu et al., ICML 2003] Smooth arg min Ŷ m l=1 W uv (Ŷul Ŷvl) 2 = m l=1 Ŷ T l LŶl such that Y ul = Ŷul, S uu =1 Match Seeds (hard) Graph Laplacian 23

87 LP-ZGL [Zhu et al., ICML 2003] Smooth arg min Ŷ m l=1 W uv (Ŷul Ŷvl) 2 = m l=1 Ŷ T l LŶl such that Y ul = Ŷul, S uu =1 Match Seeds (hard) Graph Laplacian Smoothness two nodes connected by an edge with high weight should be assigned similar labels 23

88 LP-ZGL [Zhu et al., ICML 2003] Smooth arg min Ŷ m l=1 W uv (Ŷul Ŷvl) 2 = m l=1 Ŷ T l LŶl such that Smoothness two nodes connected by an edge with high weight should be assigned similar labels Y ul = Ŷul, S uu =1 Match Seeds (hard) 23 Graph Laplacian Solution satisfies harmonic property

89 Outline Motivation Graph Construction Inference Methods Scalability - Label Propagation - Modified Adsorption - Manifold Regularization - Spectral Graph Transduction - Measure Propagation Applications Conclusion & Future Work 24

90 Modified Adsorption (MAD) [Talukdar and Crammer, ECML 2009] 25

91 Modified Adsorption (MAD) [Talukdar and Crammer, ECML 2009] arg min Ŷ m+1 l=1 SŶ l SY l 2 + µ 1 u,v M uv (Ŷ ul Ŷ vl ) 2 + µ 2 Ŷ l R l 2 m labels, +1 dummy label M = W + W is the symmetrized weight matrix Ŷ vl: weight of label l on node v Y vl : seed weight for label l on node v S: diagonal matrix, nonzero for seed nodes R vl : regularization target for label l on node v v Seed Scores Label Priors Estimated Scores 25

92 Modified Adsorption (MAD) [Talukdar and Crammer, ECML 2009] arg min Ŷ m+1 l=1 Match Seeds (soft) SŶ l SY l 2 + µ 1 u,v M uv (Ŷ ul Ŷ vl ) 2 + µ 2 Ŷ l R l 2 m labels, +1 dummy label M = W + W is the symmetrized weight matrix Ŷ vl: weight of label l on node v Y vl : seed weight for label l on node v S: diagonal matrix, nonzero for seed nodes R vl : regularization target for label l on node v v Seed Scores Label Priors Estimated Scores 25

93 Modified Adsorption (MAD) [Talukdar and Crammer, ECML 2009] arg min Ŷ m+1 l=1 Match Seeds (soft) SŶ l SY l 2 + µ 1 u,v Smooth M uv (Ŷ ul Ŷ vl ) 2 + µ 2 Ŷ l R l 2 m labels, +1 dummy label M = W + W is the symmetrized weight matrix Ŷ vl: weight of label l on node v Y vl : seed weight for label l on node v S: diagonal matrix, nonzero for seed nodes R vl : regularization target for label l on node v v Seed Scores Label Priors Estimated Scores 25

94 Modified Adsorption (MAD) [Talukdar and Crammer, ECML 2009] arg min Ŷ m+1 l=1 Match Seeds (soft) SŶ l SY l 2 + µ 1 u,v Smooth M uv (Ŷ ul Ŷ vl ) 2 + µ 2 Ŷ l Match Priors (Regularizer) R l 2 m labels, +1 dummy label M = W + W is the symmetrized weight matrix Ŷ vl: weight of label l on node v Y vl : seed weight for label l on node v S: diagonal matrix, nonzero for seed nodes R vl : regularization target for label l on node v v Seed Scores Label Priors Estimated Scores 25

95 Modified Adsorption (MAD) [Talukdar and Crammer, ECML 2009] arg min Ŷ m+1 l=1 Match Seeds (soft) SŶ l SY l 2 + µ 1 u,v Smooth M uv (Ŷ ul Ŷ vl ) 2 + µ 2 Ŷ l Match Priors (Regularizer) R l 2 m labels, +1 dummy label M = W for + none-of-the-above W is symmetrized label weight matrix Ŷ vl: weight of label l on node v Y vl : seed weight for label l on node v S: diagonal matrix, nonzero for seed nodes R vl : regularization target for label l on node v v Seed Scores Label Priors Estimated Scores 25

96 Modified Adsorption (MAD) [Talukdar and Crammer, ECML 2009] arg min Ŷ m+1 l=1 Match Seeds (soft) SŶ l SY l 2 + µ 1 u,v Smooth M uv (Ŷ ul Ŷ vl ) 2 + µ 2 Ŷ l Match Priors (Regularizer) R l 2 m labels, +1 dummy label M = W for + none-of-the-above W is symmetrized label weight matrix Ŷ vl: weight of label l on node v Y vl : seed weight for label l on node v S: diagonal matrix, nonzero for seed nodes R vl : regularization target for label l on node v v Seed Scores Label Priors Estimated Scores MAD has extra regularization compared to LP-ZGL [Zhu et al, ICML 03]; similar to QC [Bengio et al, 2006] 25

97 Modified Adsorption (MAD) [Talukdar and Crammer, ECML 2009] arg min Ŷ m+1 l=1 Match Seeds (soft) SŶ l SY l 2 + µ 1 u,v Smooth M uv (Ŷ ul Ŷ vl ) 2 + µ 2 Ŷ l Match Priors (Regularizer) R l 2 m labels, +1 dummy label M = W for + none-of-the-above W is symmetrized label weight matrix Ŷ vl: weight of label l on node v Y vl : seed weight for label l on node v S: diagonal matrix, nonzero for seed nodes R vl : regularization target for label l on node v v Seed Scores Label Priors Estimated Scores MAD s Objective is Convex MAD has extra regularization compared to LP-ZGL [Zhu et al, ICML 03]; similar to QC [Bengio et al, 2006] 25

98 Solving MAD Objective 26

99 Solving MAD Objective Can be solved using matrix inversion (like in LP) but matrix inversion is expensive 26

100 Solving MAD Objective Can be solved using matrix inversion (like in LP) but matrix inversion is expensive Instead solved exactly using a system of linear equations (Ax = b) solved using Jacobi iterations results in iterative updates guaranteed convergence see [Bengio et al., 2006] and [Talukdar and Crammer, ECML 2009] for details 26

101 27 Solving MAD using Iterative Updates Inputs Y, R : V ( L + 1), W : V V, S : V V diagonal Ŷ Y M = W + W > Current label Z v S vv + µ 1 Pu6=v M a estimate on b vu + µ 2 8v 2 V repeat for all v 2 V do Ŷ v (SY ) v + µ 1 M v Ŷ + µ 2 R v Z v end for until convergence 0.75 b Seed v Prior c 0.05

102 27 Solving MAD using Iterative Updates Inputs Y, R : V ( L + 1), W : V V, S : V V diagonal Ŷ Y M = W + W > Z v S vv + µ 1 Pu6=v M a vu + µ 2 8v 2 V repeat for all v 2 V do Ŷ v (SY ) v + µ 1 M v Ŷ + µ 2 R v Z v end for until convergence 0.75 b New label estimate on v Seed v Prior c 0.05

103 27 Solving MAD using Iterative Updates Inputs Y, R : V ( L + 1), W : V V, S : V V diagonal Ŷ Y M = W + W > Z v S vv + µ 1 Pu6=v M a vu + µ 2 8v 2 V repeat for all v 2 V do Ŷ v (SY ) v + µ 1 M v Ŷ + µ 2 R v Z v end for until convergence 0.75 b New label estimate on v Seed v Prior 0.05 Importance of a node can be discounted Easily Parallelizable: Scalable (more c later)

104 When is MAD most effective? 28

105 When is MAD most effective? 0.4 Relative Increase in MRR by MAD over LP-ZGL Average Degree 28

106 When is MAD most effective? 0.4 Relative Increase in MRR by MAD over LP-ZGL Average Degree MAD is particularly effective in denser graphs, where there is greater need for regularization. 28

107 Extension to Dependent Labels 29

108 Extension to Dependent Labels Labels are not always mutually exclusive 29

109 Extension to Dependent Labels Labels are not always mutually exclusive Label Similarity in Sentiment Classification 29

110 Extension to Dependent Labels Labels are not always mutually exclusive Label Similarity in Sentiment Classification Modified Adsorption with Dependent Labels (MADDL) [Talukdar and Crammer, ECML 2009] 29

111 Extension to Dependent Labels Labels are not always mutually exclusive Label Similarity in Sentiment Classification Modified Adsorption with Dependent Labels (MADDL) [Talukdar and Crammer, ECML 2009] Can take label similarities into account 29

112 Extension to Dependent Labels Labels are not always mutually exclusive Label Similarity in Sentiment Classification Modified Adsorption with Dependent Labels (MADDL) [Talukdar and Crammer, ECML 2009] Can take label similarities into account Convex Objective 29

113 Extension to Dependent Labels Labels are not always mutually exclusive Label Similarity in Sentiment Classification Modified Adsorption with Dependent Labels (MADDL) [Talukdar and Crammer, ECML 2009] Can take label similarities into account Convex Objective Efficient iterative/parallelizable updates as in MAD 29

114 Outline Motivation Graph Construction Inference Methods Scalability - Label Propagation - Modified Adsorption - Measure Propagation - Sparse Label Propagation - Manifold Regularization Applications Conclusion & Future Work 30

115 Measure Propagation (MP) [Subramanya and Bilmes, EMNLP 2008, NIPS 2009, JMLR 2011] CKL lx nx arg min {p i } i=1 D KL (r i p i )+µ X i,j w ij D KL (p i p j ) i=1 H(p i ) s.t. X y p i (y) =1, p i (y) 0, 8y, i 31

116 Measure Propagation (MP) [Subramanya and Bilmes, EMNLP 2008, NIPS 2009, JMLR 2011] CKL lx Divergence on seed nodes nx arg min {p i } i=1 D KL (r i p i )+µ X i,j w ij D KL (p i p j ) i=1 H(p i ) s.t. X y p i (y) =1, p i (y) 0, 8y, i Seed and estimated label distributions (normalized) on node i 31

117 Measure Propagation (MP) [Subramanya and Bilmes, EMNLP 2008, NIPS 2009, JMLR 2011] CKL lx Divergence on seed nodes Smoothness (divergence across edge) nx arg min {p i } i=1 D KL (r i p i )+µ X i,j w ij D KL (p i p j ) i=1 H(p i ) s.t. X y p i (y) =1, p i (y) 0, 8y, i Seed and estimated label distributions (normalized) on node i KL Divergence D KL (p i p j )= X y p i (y) log p i(y) p j (y) 31

118 Measure Propagation (MP) [Subramanya and Bilmes, EMNLP 2008, NIPS 2009, JMLR 2011] CKL arg min {p i } lx Divergence on seed nodes i=1 D KL (r i p i )+µ X i,j s.t. X y p i (y) =1, p i (y) Smoothness (divergence across edge) w ij D KL (p i p j ) 0, 8y, i Entropic Regularizer nx i=1 H(p i ) Seed and estimated label distributions (normalized) on node i KL Divergence D KL (p i p j )= X y p i (y) log p i(y) p j (y) H(p i )= Entropy X y p i (y) log p i (y) 31

119 Measure Propagation (MP) [Subramanya and Bilmes, EMNLP 2008, NIPS 2009, JMLR 2011] CKL arg min {p i } lx Divergence on seed nodes i=1 D KL (r i p i )+µ X i,j s.t. X y p i (y) =1, p i (y) Smoothness (divergence across edge) w ij D KL (p i p j ) 0, 8y, i Entropic Regularizer nx i=1 H(p i ) Seed and estimated label distributions (normalized) on node i KL Divergence D KL (p i p j )= X y p i (y) log p i(y) p j (y) H(p i )= Entropy X y p i (y) log p i (y) Normalization Constraint 31

120 Measure Propagation (MP) [Subramanya and Bilmes, EMNLP 2008, NIPS 2009, JMLR 2011] CKL arg min {p i } lx Divergence on seed nodes i=1 D KL (r i p i )+µ X i,j s.t. X y p i (y) =1, p i (y) Smoothness (divergence across edge) w ij D KL (p i p j ) 0, 8y, i Entropic Regularizer nx i=1 H(p i ) Seed and estimated label distributions (normalized) on node i KL Divergence D KL (p i p j )= X y p i (y) log p i(y) p j (y) H(p i )= Entropy X y p i (y) log p i (y) Normalization Constraint CKL is convex (with non-negative edge weights and hyper-parameters) MP is related to Information Regularization [Corduneanu and Jaakkola, 2003] 31

121 Solving MP Objective For ease of optimization, reformulate MP objective: CMP lx nx arg min {p i,q i } i=1 D KL (r i q i )+µ X i,j w 0 ijd KL (p i q j ) i=1 H(p i ) 32

122 Solving MP Objective For ease of optimization, reformulate MP objective: CMP lx nx arg min {p i,q i } i=1 D KL (r i q i )+µ X i,j w 0 ijd KL (p i q j ) i=1 H(p i ) New probability measure, one for each vertex, similar to pi 32

123 Solving MP Objective For ease of optimization, reformulate MP objective: CMP lx nx arg min {p i,q i } i=1 D KL (r i q i )+µ X i,j w 0 ijd KL (p i q j ) i=1 H(p i ) New probability measure, one for each vertex, similar to pi w 0 ij = w ij + (i, j) 32

124 Solving MP Objective For ease of optimization, reformulate MP objective: CMP lx nx arg min {p i,q i } i=1 D KL (r i q i )+µ X i,j w 0 ijd KL (p i q j ) i=1 H(p i ) New probability measure, one for each vertex, similar to pi w 0 ij = w ij + (i, j) Encourages agreement between pi and qi 32

125 Solving MP Objective For ease of optimization, reformulate MP objective: CMP lx nx arg min {p i,q i } i=1 D KL (r i q i )+µ X i,j w 0 ijd KL (p i q j ) i=1 H(p i ) New probability measure, one for each vertex, similar to pi CMP is also convex w 0 ij = w ij + (i, j) Encourages agreement between pi and qi (with non-negative edge weights and hyper-parameters) 32

126 Solving MP Objective For ease of optimization, reformulate MP objective: CMP lx nx arg min {p i,q i } i=1 D KL (r i q i )+µ X i,j w 0 ijd KL (p i q j ) i=1 H(p i ) New probability measure, one for each vertex, similar to pi CMP is also convex w 0 ij = w ij + (i, j) Encourages agreement between pi and qi (with non-negative edge weights and hyper-parameters) CMP can be solved using Alternating Minimization (AM) 32

127 Alternating Minimization 33

128 Alternating Minimization Q 0 33

129 Alternating Minimization P 1 Q 0 33

130 Alternating Minimization P 1 Q 1 Q 0 33

131 Alternating Minimization P 2 P 1 Q 1 Q 0 33

132 Alternating Minimization P 2 P 1 Q 2 Q 1 Q 0 33

133 Alternating Minimization P 3 P 2 P 1 Q 2 Q 1 Q 0 33

134 Alternating Minimization P 3 P 2 P 1 Q 2 Q 1 Q 0 33

135 Alternating Minimization P 3 P 2 P 1 Q 2 Q 1 Q 0 33

136 Alternating Minimization P 3 P 2 P 1 Q 2 Q 1 CMP satisfies the necessary conditions for AM to Q 0 converge [Subramanya and Bilmes, JMLR 2011] 33

137 Outline Motivation Graph Construction Inference Methods Scalability - Label Propagation - Modified Adsorption - Measure Propagation - Sparse Label Propagation - Manifold Regularization Applications Conclusion & Future Work 34

138 Background: Factor Graphs [Kschischang et al., 2001] Factor Graph bipartite graph variable nodes (e.g., label distribution on a node) factor nodes: fitness function over variable assignment Variable Nodes (V) Factor Nodes (F) Distribution over all variables values 35 variables connected to factor f

139 Factor Graph Interpretation of Graph SSL [Zhu et al., ICML 2003] [Das and Smith, NAACL 2012] 3-term Graph SSL Objective (common to many algorithms) min Seed Matching Loss (if any) Edge Smoothness Loss + + Regularization Loss 36

140 Factor Graph Interpretation of Graph SSL [Zhu et al., ICML 2003] [Das and Smith, NAACL 2012] 3-term Graph SSL Objective (common to many algorithms) min Seed Matching Loss (if any) Edge Smoothness Loss + + Regularization Loss w 1,2 q 1 q 2 2 q 1 q 2 36

141 Factor Graph Interpretation of Graph SSL [Zhu et al., ICML 2003] [Das and Smith, NAACL 2012] 3-term Graph SSL Objective (common to many algorithms) min Seed Matching Loss (if any) Edge Smoothness Loss + + Regularization Loss w 1,2 q 1 q 2 2 q 1 q 2 q (q 1,q 2 ) 1 q 2 36

142 Factor Graph Interpretation of Graph SSL [Zhu et al., ICML 2003] [Das and Smith, NAACL 2012] 3-term Graph SSL Objective (common to many algorithms) min Seed Matching Loss (if any) Edge Smoothness Loss + + Regularization Loss w 1,2 q 1 q 2 2 q 1 q 2 q (q 1,q 2 ) 1 q 2 Smoothness Factor (q 1,q 2 ) / w 1,2 q 1 q

143 Factor Graph Interpretation of Graph SSL [Zhu et al., ICML 2003] [Das and Smith, NAACL 2012] 3-term Graph SSL Objective (common to many algorithms) min Seed Matching Loss (if any) Edge Smoothness Loss + + Regularization Loss Seed Matching Factor (unary) w 1,2 q 1 q 2 2 q 1 q 2 q (q 1,q 2 ) 1 q 2 Smoothness Factor (q 1,q 2 ) / w 1,2 q 1 q

144 Factor Graph Interpretation of Graph SSL [Zhu et al., ICML 2003] [Das and Smith, NAACL 2012] 3-term Graph SSL Objective (common to many algorithms) min Seed Matching Loss (if any) Edge Smoothness Loss + + Regularization Loss Seed Matching Factor (unary) w 1,2 q 1 q 2 2 q 1 q 2 Regularization factor (unary) q (q 1,q 2 ) 1 q 2 Smoothness Factor (q 1,q 2 ) / w 1,2 q 1 q

145 Factor Graph Interpretation [Zhu et al., ICML 2003][Das and Smith, NAACL 2012] r 1 r 2 2. Smoothness Factor q 9264 q 9265 q 1 q 2 3. Unary factor for regularization log t (q t ) q 9266 r 3 q 3 q r 4 4 q 9267 q 9268 q 9269 q Factor encouraging agreement on seed labels 37

146 Label Propagation with Sparsity 38

147 Label Propagation with Sparsity Enforce through sparsity inducing unary factor 38

148 Label Propagation with Sparsity Enforce through sparsity inducing unary factor Lasso (Tibshirani, 1996) log t (q t )= Elitist Lasso (Kowalski and Torrésani, 2009) log t (q t )= 38

149 Label Propagation with Sparsity Enforce through sparsity inducing unary factor Lasso (Tibshirani, 1996) log t (q t )= Elitist Lasso (Kowalski and Torrésani, 2009) log t (q t )= For more details, see [Das and Smith, NAACL 2012] 38

150 Outline Motivation Graph Construction Inference Methods Scalability - Label Propagation - Modified Adsorption - Measure Propagation - Sparse Label Propagation - Manifold Regularization Applications Conclusion & Future Work 39

151 Manifold Regularization [Belkin et al., JMLR 2006] f 1 = arg min f l lx i=1 V (y i,f(x i )) + f T Lf + f 2 K 40

152 Manifold Regularization [Belkin et al., JMLR 2006] f 1 = arg min f l lx i=1 Training Data Loss V (y i,f(x i )) + f T Lf + f 2 K Loss Function (e.g., soft margin) 40

153 Manifold Regularization [Belkin et al., JMLR 2006] f 1 = arg min f l lx i=1 Training Data Loss Smoothness Regularizer V (y i,f(x i )) + f T Lf + f 2 K Loss Function (e.g., soft margin) Laplacian of graph over labeled and unlabeled data 40

154 Manifold Regularization [Belkin et al., JMLR 2006] f 1 = arg min f l lx i=1 Training Data Loss Smoothness Regularizer Regularizer (e.g., L2) V (y i,f(x i )) + f T Lf + f 2 K Loss Function (e.g., soft margin) Laplacian of graph over labeled and unlabeled data 40

155 Manifold Regularization [Belkin et al., JMLR 2006] f 1 = arg min f l lx i=1 Training Data Loss Smoothness Regularizer Regularizer (e.g., L2) V (y i,f(x i )) + f T Lf + f 2 K Loss Function (e.g., soft margin) Laplacian of graph over labeled and unlabeled data Trains an inductive classifier which can generalize to unseen instances 40

156 Other Graph-based SSL Methods TACO [Orbach and Crammer, ECML 2012] SSL on Directed Graphs [Zhou et al, NIPS 2005], [Zhou et al., ICML 2005] Spectral Graph Transduction [Joachims, ICML 2003] Graph-SSL for Ordering [Talukdar et al., CIKM 2012] Learning with dissimilarity edges [Goldberg et al., AISTATS 2007] 41

157 Outline Motivation Graph Construction Inference Methods Scalability Applications - Scalability Issues - Node reordering - MapReduce Parallelization Conclusion & Future Work 42

158 More (Unlabeled) Data is Better Data 43 [Subramanya & Bilmes, JMLR 2011]

159 More (Unlabeled) Data is Better Data Graph with 120m vertices 43 [Subramanya & Bilmes, JMLR 2011]

160 More (Unlabeled) Data is Better Data Graph with 120m vertices Challenges with large unlabeled data: Constructing graph from large data Scalable inference over large graphs 43 [Subramanya & Bilmes, JMLR 2011]

161 Outline Motivation Graph Construction Inference Methods Scalability Applications - Scalability Issues - Node reordering [Subramanya & Bilmes, JMLR 2011; Bilmes & Subramanya, 2011] - MapReduce Parallelization Conclusion & Future Work 44

162 Label Update using Message Passing Processor 1 1 a b Processor 2 2 Current label estimate on a v Processor k k Seed Prior Processor 1 k SMP with k Processors. Graph nodes (neighbors not shown) c 45

163 Label Update using Message Passing Processor 1 1 a b Processor 2 2. v Processor k k Seed Prior Processor 1 k+1 New label estimate on v 0.05 SMP with k Processors. Graph nodes (neighbors not shown) c 45

164 Node Reordering Algorithm : Intuition a k b c 46

165 Node Reordering Algorithm : Intuition a k b c Which node should be processed along with k: the one with highest intersection of neighborhood with k 46

166 Node Reordering Algorithm : Intuition e a j c a k b c Which node should be processed along with k: the one with highest intersection of neighborhood with k 46

167 Node Reordering Algorithm : Intuition e a j c a a k b b d c c l Which node should be processed along with k: the one with highest intersection of neighborhood with k c n m 46

168 Node Reordering Algorithm : Intuition e N(k) \ N(a) =1 a j c a a k b b d c c l Which node should be processed along with k: the one with highest intersection of neighborhood with k c n m 46

169 Node Reordering Algorithm : Intuition Cardinality of Intersection N(k) \ N(a) =1 a e j c a a k b b d c c l Which node should be processed along with k: the one with highest intersection of neighborhood with k c n m 46

170 Node Reordering Algorithm : Intuition Cardinality of Intersection N(k) \ N(a) =1 a e j c a a k b N(k) \ N(b) =2 b d c N(k) \ N(c) =0 c l Which node should be processed along with k: the one with highest intersection of neighborhood with k c n m 46

171 Node Reordering Algorithm : Intuition Cardinality of Intersection N(k) \ N(a) =1 Best Node a e j c a a k b N(k) \ N(b) =2 b d c N(k) \ N(c) =0 c l Which node should be processed along with k: the one with highest intersection of neighborhood with k c n m 46

172 Speed-up on SMP after Node Ordering 47 [Subramanya & Bilmes, JMLR, 2011]

173 Outline Motivation Graph Construction Inference Methods Scalability Applications - Scalability Issues - Node reordering - MapReduce Parallelization Conclusion & Future Work 48

174 MapReduce Implementation of MAD a Current label estimate on b b Seed v Prior c

175 MapReduce Implementation of MAD Map Each node send its current label assignments to its neighbors a 0.60 Current label estimate on b 0.75 b Seed v Prior c

176 MapReduce Implementation of MAD Map Each node send its current label assignments to its neighbors a b Seed v Prior c

177 MapReduce Implementation of MAD Map Each node send its current label assignments to its neighbors a b Reduce Each node updates its own label assignment using messages received from neighbors, and its own information (e.g., seed labels, reg. penalties etc.) Repeat until convergence v Reduce Seed Prior c

178 MapReduce Implementation of MAD Map a b Each node send its current label assignments to its neighbors Reduce New label estimate on v Each node updates its own label assignment using messages received from neighbors, and its own information (e.g., seed labels, reg. penalties etc.) 0.60 Seed v Prior Repeat until convergence c 49

179 MapReduce Implementation of MAD Map Each node send its current label assignments to its neighbors Reduce New label estimate on v Each node updates its own label assignment using messages received from neighbors, and its own information (e.g., seed labels, a 0.60 Seed v Prior Code in Junto Label Propagation Toolkit reg. penalties etc.) b Repeat (includes until convergence Hadoop-based implementation) c 49

180 MapReduce Implementation of MAD Map Each node send its current label assignments to its neighbors Reduce New label estimate on v Each node updates its own label assignment using messages received from neighbors, and its own information (e.g., seed labels, reg. penalties etc.) a 0.60 Graph-based algorithms are amenable to distributed processing Seed v Prior Code in Junto Label Propagation Toolkit b Repeat (includes until convergence Hadoop-based implementation) c 49

181 When to use Graph-based SSL and which method? 50

182 When to use Graph-based SSL and which method? When input data itself is a graph (relational data) or, when the data is expected to lie on a manifold 50

183 When to use Graph-based SSL and which method? When input data itself is a graph (relational data) or, when the data is expected to lie on a manifold MAD, Quadratic Criteria (QC) when labels are not mutually exclusive MADDL: when label similarities are known 50

184 When to use Graph-based SSL and which method? When input data itself is a graph (relational data) or, when the data is expected to lie on a manifold MAD, Quadratic Criteria (QC) when labels are not mutually exclusive MADDL: when label similarities are known Measure Propagation (MP) for probabilistic interpretation 50

185 When to use Graph-based SSL and which method? When input data itself is a graph (relational data) or, when the data is expected to lie on a manifold MAD, Quadratic Criteria (QC) when labels are not mutually exclusive MADDL: when label similarities are known Measure Propagation (MP) for probabilistic interpretation Manifold Regularization for generalization to unseen data (induction) 50

186 Graph-based SSL: Summary 51

187 Graph-based SSL: Summary Provide flexible representation for both IID and relational data 51

188 Graph-based SSL: Summary Provide flexible representation for both IID and relational data Graph construction can be key 51

189 Graph-based SSL: Summary Provide flexible representation for both IID and relational data Graph construction can be key Scalable: Node Reordering and MapReduce 51

190 Graph-based SSL: Summary Provide flexible representation for both IID and relational data Graph construction can be key Scalable: Node Reordering and MapReduce Can handle labeled as well as unlabeled data 51

191 Graph-based SSL: Summary Provide flexible representation for both IID and relational data Graph construction can be key Scalable: Node Reordering and MapReduce Can handle labeled as well as unlabeled data Can handle multi class, multi label settings 51

192 Graph-based SSL: Summary Provide flexible representation for both IID and relational data Graph construction can be key Scalable: Node Reordering and MapReduce Can handle labeled as well as unlabeled data Can handle multi class, multi label settings Effective in practice 51

193 Open Challenges 52

194 Open Challenges Graph-based SSL for Structured Prediction Algorithms: Combining Inductive and graph-based methods Applications: Constituency and dependency parsing, Coreference 52

195 Open Challenges Graph-based SSL for Structured Prediction Algorithms: Combining Inductive and graph-based methods Applications: Constituency and dependency parsing, Coreference Scalable graph construction, especially with multi-modal data 52

196 Open Challenges Graph-based SSL for Structured Prediction Algorithms: Combining Inductive and graph-based methods Applications: Constituency and dependency parsing, Coreference Scalable graph construction, especially with multi-modal data Extensions with other loss functions, sparsity, etc. 52

197 53

198 References (I) [1] A. Alexandrescu and K. Kirchhoff. Data-driven graph construction for semi-supervised graph-based learning in nlp. In NAACL HLT, [2] Y. Altun, D. McAllester, and M. Belkin. Maximum margin semi-supervised learn- ing for structured variables. NIPS, [3] S. Baluja, R. Seth, D. Sivakumar, Y. Jing, J. Yagnik, S. Kumar, D. Ravichandran, and M. Aly. Video suggestion and discovery for youtube: taking random walks through the view graph. In WWW, [4] R. Bekkerman, R. El-Yaniv, N. Tishby, and Y. Winter. Distributional word clusters vs. words for text categorization. J. Mach. Learn. Res., 3: , [5] M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research, 7: , [6] Y. Bengio, O. Delalleau, and N. Le Roux. Label propagation and quadratic criterion. Semi-supervised learning, [7] T. Berg-Kirkpatrick, A. Bouchard-Coˆt é, J. DeNero, and D. Klein. Painless unsupervised learning with features. In HLT-NAACL, [8] J. Bilmes and A. Subramanya. Scaling up Machine Learning: Parallel and Distributed Approaches, chapter Parallel Graph-Based Semi- Supervised Learning [9] S. Blair-goldensohn, T. Neylon, K. Hannan, G. A. Reis, R. Mcdonald, and J. Reynar. Building a sentiment summarizer for local service reviews. In In NLP in the Information Explosion Era, [10] M. Cafarella, A. Halevy, D. Wang, E. Wu, and Y. Zhang. Webtables: exploring the power of tables on the web. VLDB, [11] O. Chapelle, B. Scho lkopf, A. Zien, et al. Semi-supervised learning. MIT press Cambridge, MA:, [12] Y. Choi and C. Cardie. Adapting a polarity lexicon using integer linear program- ming for domain specific sentiment classification. In EMNLP, [13] S. Daitch, J. Kelner, and D. Spielman. Fitting a graph to vector data. In ICML, [14] D. Das and S. Petrov. Unsupervised part-of-speech tagging with bilingual graph- based projections. In ACL, [15] D. Das, N. Schneider, D. Chen, and N. A. Smith. Probabilistic frame-semantic parsing. In NAACL-HLT, [16] D. Das and N. Smith. Graph-based lexicon expansion with sparsity-inducing penalties. NAACL-HLT, [17] D. Das and N. A. Smith. Semi-supervised frame-semantic parsing for unknown predicates. In ACL, [18] J. Davis, B. Kulis, P. Jain, S. Sra, and I. Dhillon. Information-theoretic metric learning. In ICML, [19] O. Delalleau, Y. Bengio, and N. L. Roux. Efficient non-parametric function induction in semi-supervised learning. In AISTATS, [20] P. Dhillon, P. Talukdar, and K. Crammer. Inference-driven metric learning for graph construction. Technical report, MS-CIS-10-18, University of Pennsylvania,

199 References (II) [21] S. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and representations for text categorization. In CIKM, [22] J. Friedman, J. Bentley, and R. Finkel. An algorithm for finding best matches in logarithmic expected time. ACM Transaction on Mathematical Software, 3, [23] J. Garcke and M. Griebel. Data mining with sparse grids using simplicial basis functions. In KDD, [24] A. Goldberg and X. Zhu. Seeing stars when there aren t many stars: graph-based semi-supervised learning for sentiment categorization. In Proceedings of the First Workshop on Graph Based Methods for Natural Language Processing, [25] A. Goldberg, X. Zhu, and S. Wright. Dissimilarity in graph-based semi-supervised classification. AISTATS, [26] M. Hu and B. Liu. Mining and summarizing customer reviews. In KDD, [27] T. Jebara, J. Wang, and S. Chang. Graph construction and b-matching for semi-supervised learning. In ICML, [28] T. Joachims. Transductive inference for text classification using support vector machines. In ICML, [29] T. Joachims. Transductive learning via spectral graph partitioning. In ICML, [30] M. Karlen, J. Weston, A. Erkan, and R. Collobert. Large scale manifold transduction. In ICML, [31] S.-M. Kim and E. Hovy. Determining the sentiment of opinions. In Proceedings of the 20th International conference on Computational Linguistics, [32] F. Kschischang, B. Frey, and H. Loeliger. Factor graphs and the sum-product algorithm. Information Theory, IEEE Transactions on, 47(2): , [33] K. Lerman, S. Blair-Goldensohn, and R. McDonald. Sentiment summarization: evaluating and learning user preferences. In EACL, [34] D.Lewisetal.Reuters [35] J. Malkin, A. Subramanya, and J. Bilmes. On the semi-supervised learning of multi-layered perceptrons. In InterSpeech, [36] B. Pang, L. Lee, and S. Vaithyanathan. Thumbs up?: sentiment classification using machine learning techniques. In EMNLP, [37] D. Rao and D. Ravichandran. Semi-supervised polarity lexicon induction. In EACL, [38] A. Subramanya and J. Bilmes. Soft-supervised learning for text classification. In EMNLP, [39] A. Subramanya and J. Bilmes. Entropic graph regularization in non-parametric semi-supervised classification. NIPS, [40] A. Subramanya and J. Bilmes. Semi-supervised learning with measure propagation. JMLR,

200 References (III) [41] A. Subramanya, S. Petrov, and F. Pereira. Efficient graph-based semi-supervised learning of structured tagging models. In EMNLP, [42] P. Talukdar. Topics in graph construction for semi-supervised learning. Technical report, MS-CIS-09-13, University of Pennsylvania, [43] P. Talukdar and K. Crammer. New regularized algorithms for transductive learning. ECML, [44] P. Talukdar and F. Pereira. Experiments in graph-based semi-supervised learning methods for class-instance acquisition. In ACL, [45] P. Talukdar, J. Reisinger, M. Pa şca, D. Ravichandran, R. Bhagat, and F. Pereira. Weakly-supervised acquisition of labeled class instances using graph random walks. In EMNLP, [46] B. Van Durme and M. Pasca. Finding cars, goddesses and enzymes: Parametrizable acquisition of labeled instances for opendomain information extraction. In AAAI, [47] L. Velikovich, S. Blair-Goldensohn, K. Hannan, and R. McDonald. The viability of web-derived polarity lexicons. In HLT-NAACL, [48] F. Wang and C. Zhang. Label propagation through linear neighborhoods. In ICML, [49] J. Wang, T. Jebara, and S. Chang. Graph transduction via alternating minimization. In ICML, [50] R. Wang and W. Cohen. Language-independent set expansion of named entities using the web. In ICDM, [51] K. Weinberger and L. Saul. Distance metric learning for large margin nearest neighbor classification. The Journal of Machine Learning Research, 10: , [52] T. Wilson, J. Wiebe, and P. Hoffmann. Recognizing contextual polarity in phrase- level sentiment analysis. In HLT-EMNLP, [53] D. Zhou, O. Bousquet, T. Lal, J. Weston, and B. Scho lkopf. Learning with local and global consistency. NIPS, [54] D. Zhou, J. Huang, and B. Scho lkopf. Learning from labeled and un- labeled data on a directed graph. In ICML, [55] D. Zhou, B. Scho lkopf, and T. Hofmann. Semi-supervised learning on directed graphs. In NIPS, [56] X. Zhu and Z. Ghahramani. Learning from labeled and unlabeled data with label propagation. Technical report, CMU- CALD , Carnegie Mellon University, [57] X. Zhu and Z. Ghahramani. Learning from labeled and unlabeled data with label propagation. Technical report, Carnegie Mellon University, [58] X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised learning using gaussian fields and harmonic functions. In ICML, [59] X. Zhu and J. Lafferty. Harmonic mixtures: combining mixture models and graph- based methods for inductive and scalable semi-supervised learning. In ICML,

201 Thanks! Web: 57

Graph-based Semi- Supervised Learning as Optimization

Graph-based Semi- Supervised Learning as Optimization Partha Pratim Talukdar CMU Machine Learning with Large Datasets (10-605) April 3, 2012 Graph-based Semi-Supervised Learning 0.2 0.1 0.2 0.3 0.3 0.2