Learning the Structure of Sum-Product Networks. Robert Gens Pedro Domingos

Size: px

Start display at page:

Download "Learning the Structure of Sum-Product Networks. Robert Gens Pedro Domingos"

Hope Bryant
5 years ago
Views:

1 Learning the Structure of Sum-Product Networks Robert Gens Pedro Domingos

2 w 20 10x O(n) X Y LL PLL CLL CMLL Motivation SPN Structure Experiments Review Learning

3 Graphical Models Representation Inference Learning Compact and expressive Global independence Exponential in treewidth of graph Extremely difficult because: Learning requires inference Approximate inference is unreliable Hidden variables no global optimum

4 Sum-Product Networks Representation Inference Compact and expressive Local independence Linear in number of edges Learning Much easier because exact inference Only weight learning to date

5 This Paper: SPN Structure Learning General-purpose SPN structure learning Discrete or continuous Learns layers of hidden variables Fully leverages context-specific independence Simple and intuitive 1-3 orders of magnitude faster, and more accurate at query time

6 w 20 10x O(n) X Y LL PLL CLL CMLL Motivation SPN Structure Experiments Review Learning

7 Compactly Representable Probability Distributions Graphical Models Sum-Product Networks Existing Tractable Models

8 Compactly Representable Probability Distributions Graphical Models Sum-Product Networks Existing Exact inference in time Tractable linear in network size Models

9 A Univariate Distribution Is an SPN. X

10 A Univariate Distribution Is an SPN. X... Multinomial Gaussian Poisson

11 A Univariate Distribution Is an SPN. X

12 X

13 A Product of SPNs over Disjoint Variables Is an SPN. X Y

14 A Product of SPNs over Disjoint Variables Is an SPN. X Y

15 X Y

16 A Weighted Sum of SPNs over the Same Variables Is an SPN. w 1 w 2 X Y X Y

17 A Weighted Sum of SPNs over the Same Variables Is an SPN. w 1 w 2 Sums out a mixture variable X Y X Y

18 X1 X1 X1 X1 X2 X2 X2 X2 X3 X3 X3 X3 X4 X4 X4 X4 X5 X5 X5 X5 X6 X6 X6 X6

19 All Marginals Are Computable in Linear Time

20 All Marginals Are Computable in Linear Time X Y

21 All Marginals Are Computable in Linear Time w 1 w 2 X Y X Y

22 All Marginals Are Computable in Linear Time X Y X Y

23 All Marginals Are Computable in Linear Time? X Y X Y

24 All Marginals Are Computable in Linear Time? X Y X Y Evidence Evidence

25 All Marginals Are Computable in Linear Time? X Y X Y Evidence Evidence

26 All Marginals Are Computable in Linear Time? X Y X Y Evidence Marginalize Evidence Marginalize

27 All Marginals Are Computable in Linear Time? X Y X Y Evidence Marginalize Evidence Marginalize

28 All Marginals Are Computable in Linear Time = X Y X Y Evidence Marginalize Evidence Marginalize

29 X Y X Y

30 All MAP States Are Computable in Linear Time X Y X Y

31 All MAP States Are Computable in Linear Time max X Y X Y

32 All MAP States Are Computable in Linear Time max y,h P (X =0,Y = y, H = h)? max X Y X Y

33 All MAP States Are Computable in Linear Time max y,h P (X =0,Y = y, H = h)? max X Y X Y Evidence Evidence

34 All MAP States Are Computable in Linear Time max y,h P (X =0,Y = y, H = h)? max X Y X Y Evidence Mode Evidence Mode

35 All MAP States Are Computable in Linear Time max y,h P (X =0,Y = y, H = h)? max X Y X Y Evidence Mode Evidence Mode

36 All MAP States Are Computable in Linear Time max y,h P (X =0,Y = y, H = h) = 0.12 max X Y X Y Evidence Mode Evidence Mode

37 All MAP States Are Computable in Linear Time max y,h P (X =0,Y = y, H = h) = 0.12 max X Y X Y Evidence Mode Evidence Mode

38 All MAP States Are Computable in Linear Time max y,h P (X =0,Y = y, H = h) = 0.12 max X Y X Y Evidence Mode Evidence Mode

39 All MAP States Are Computable in Linear Time max y,h P (X =0,Y = y, H = h) = 0.12 max X Y X Y Evidence Mode Evidence Mode

40 What Does an SPN Mean? Products = Features Sums = Clusterings

41 What Does an SPN Mean? Products = Features Sums = Clusterings Hay Greenery Truck Sign Flowers Gravel

42 What Does an SPN Mean? Products = Features Sums = Clusterings Hay Greenery Truck Sign Flowers Gravel Calf Cow Bull Person patting Person on bike

43 What Does an SPN Mean? Products = Features Sums = Clusterings Hay Greenery Truck Sign Flowers Gravel Calf Cow Bull Person patting Person on bike Tail Body Legs Head Arms Body Legs Head

44 What Does an SPN Mean? Products = Features Sums = Clusterings Hay Greenery Truck Sign Flowers Gravel Calf Cow Bull Person patting Person on bike Tail Body Legs Head Arms Body Legs Head

45 What Does an SPN Mean? Products = Features Sums = Clusterings Scene decomposition Hay Greenery Truck Sign Flowers Mixture Gravel Calf Cow Bull Person patting Person on bike Pooling Tail Body Legs Head Arms Body Legs Head Part decomposition

46 Special Cases Hierarchical mixture models Thin junction trees (e.g.: hidden Markov models) Non-recursive probabilistic context-free grammars Etc.

47 Weight Learning Generative (Poon & Domingos, UAI 2011) UAI 2011 Best Paper Award Discriminative (Gens & Domingos, NIPS 2012) NIPS 2012 Outstanding Student Paper Award

48 Weight Learning Generative (Poon & Domingos, UAI 2011) UAI 2011 Best Paper Award Discriminative (Gens & Domingos, NIPS 2012) NIPS 2012 Outstanding Student Paper Award Key limitation: Requires a structure

49 w 20 10x O(n) X Y LL PLL CLL CMLL Motivation SPN Structure Experiments Review Learning

50 LearnSPN Recursive algorithm Instances Variables

51 LearnSPN Recursive algorithm Instances Variables

52 LearnSPN Return: Instances Multinomial Gaussian X 1 Poisson

53 Instances Variables

54 Establish approximate independence Instances Variables

55 Establish approximate independence Instances Variables Return: Recurse Recurse Recurse

56 Establish approximate independence Instances If no independence, cluster similar instances Variables Return: Recurse Recurse Recurse

57 Establish approximate independence Instances If no independence, cluster similar instances Variables Return: Return: w 1 w 2 w 3 Recurse Recurse Recurse Recurse Recurse Recurse

58 If no detectable dependencies Instances Variables Return: Fully factorized R... R... R... R... R... distribution

59 If no detectable dependencies Instances Variables Return: Fully factorized distribution

60 Instances Variables If never finds independence Return: w 1 w 7 R... R... R... R... R... Kernel density estimate

61 Instances If never finds independence Variables Return: w 1 w 2 w 7 Kernel density estimate

62 Establish approximate independence Return: Instances Variables If no independence, cluster similar instances Return: w 1 w 2 w 3 Recurse Recurse Recurse Recurse Recurse Recurse

63 Establish approximate independence Return: Instances If no independence, cluster similar instances Return: w 1 w 2 w 3 Recurse Recurse Recurse Variables Recurse Recurse Recurse

64 Establish approximate independence Instances If no independence, cluster similar instances Variables Return: Return: w 1 w 2 w 3 Recurse Recurse Recurse Recurse Recurse Recurse Recurse Recurse

65 Establish approximate independence Instances If no independence, cluster similar instances Variables Return: Return: w 1 w 2 w 3 Recurse Recurse Recurse Recurse Recurse Recurse Recurse Recurse

66 Instances Variables

67 Instances R... R... Variables

68 Instances R... R... R... R... R... Variables

69 Instances R... R... R... R... R... R... R... R... R... Variables

70 Instances R... R... R... R... R... R... R... R... R... Variables High treewidth

71 C Instances C Variables High treewidth

72 Instances C Variables High treewidth

73 Training LearnSPN set Establish approximate independence Return: Instances Variables If no independence, cluster similar instances Return: w 1 w 2 w 3 Recurse Recurse Recurse If V =1, Return: Recurse Recurse Recurse

74 Variable Splits ( ) Pairwise independence test p-value (validation set) X 6 X 7 X 5 X 1 X 4 X 2 X 3

75 Variable Splits ( ) Pairwise independence test p-value (validation set) X 6 X 7 X 5 X 1 X 4 X 2 X 3

76 Variable Splits ( ) Pairwise independence test p-value (validation set) Recurse Recurse X 6 X 5 X 7 X 1 X 2 X 3 X 4

77 Variable Splits ( ) Pairwise independence test p-value (validation set) X 6 X 7 X 5 X 1 X 4 X 2 X 3

78 Instance Clustering (+) X 6 X 7 X 5 X 1 X 4 X 2 X 3

79 Instance Clustering (+) X 1 X 2 X 3 X 4 X 5 X 6 X 7

80 Instance Clustering (+) m 1 m 2 m 3 m 4 m 5 m 6 m 7 X 1 X 2 X 3 X 4 X 5 X 6 X 7

81 Instance Clustering (+) m 1 m 2 m 3 m 4 m 5 m 6 m 7

82 Instance Clustering (+) Online hard EM Naive Bayes mixture model Cluster penalty (validation set) P (V )= X i Learned weights are the mixture priors P (C i ) Y j P (X j C i ) m 1 m 2 m 3 m 4 m 5 m 6 m 7

83 Instance Clustering (+) Online hard EM Naive Bayes mixture model Cluster penalty (validation set) P (V )= X i Learned weights are the mixture priors P (C i ) Y j P (X j C i ) m 1 m 2 m 3 m 4 m 5 m 6 m 7 C 1 C 2 C 3

84 Instance Clustering (+) Online hard EM Naive Bayes mixture model Cluster penalty (validation set) P (V )= X i Learned weights are the mixture priors P (C i ) Y j P (X j C i ) m 1 m 2 m 3 m 4 m 5 m 6 m 7 C 1 C 2 C 3

85 Instance Clustering (+) Online hard EM Naive Bayes mixture model Cluster penalty (validation set) P (V )= X i Learned weights are the mixture priors P (C i ) Y j P (X j C i ) Recurse Recurse Recurse m 3 m 4 m 1 m 2 C 2 C 1 m 5 m 6 C 3 m 7

86 Instance Clustering (+) Online hard EM Naive Bayes mixture model Cluster penalty (validation set) P (V )= X i Learned weights are the mixture priors P (C i ) Y j P (X j C i ) Recurse Recurse Recurse 2 7 m 3 m 4 m 1 m 2 C 2 C 1 m 5 m 6 C 3 m 7

87 LearnSPN Locally Optimizes Likelihood

88 LearnSPN Locally Optimizes Likelihood X i X 1 X 2 X j No loss of likelihood if truly independent X k

89 LearnSPN Locally Optimizes Likelihood X i X 1 X 2 X j No loss of likelihood if truly independent X k Naive Bayes likelihood LearnSPN likelihood w 1 w 2 w 1 w 2 Recurse Recurse

90 w 20 10x O(n) X Y LL PLL CLL CMLL Motivation SPN Structure Experiments Review Learning

91 Experiments 20 Datasets collaborative filtering click-through logs nucleic acid sequences... Representation Learning Inference SPN LearnSPN Exact Bayesian Gibbs WinMine Network Loopy BP Instances 2k-388k Variables Markov Network Della Pietra L1 Gibbs Loopy BP Gibbs Loopy BP Total: 1680 Experiments

92 Learning Results SPN Wins (p=0.05) SPN Ties SPN Losses WinMine Della Pietra L1 Log-likelihood Pseudo Log-likelihood

93 Learning Results SPN Wins (p=0.05) SPN Ties SPN Losses WinMine Della Pietra L1 Log-likelihood Pseudo Log-likelihood

94 Inference

95 P( Query Evidence ) Inference

96 Inference P( Query Evidence ) Q E 10% 30% 20% 30% 30% 30% 40% 30% 50% 30% 30% 0% 30% 10% 30% 20% 30% 40% 30% 50% For each proportion, generate 1000 queries from test set

97 Inference P( Query Evidence ) Q E 10% 30% 20% 30% 30% 30% 40% 30% 50% 30% 30% 0% 30% 10% 30% 20% 30% 40% 30% 50% { WinMine Della Pietra L1 X Gibbs {Loopy BP { X 20 Datasets X { 10 Variable proportions = 1200 Experiments (In addition to 400 for SPNs) For each proportion, generate 1000 queries from test set

98 Inference Accuracy EachMovie, 500 variables SPN Conditional Log-Likelihood WinMine L % 20% 30% 40% 50% Della Pietra Fraction of Query Variables

99 Inference Time Gibbs ms/query WinMine Della Pietra L1 SPN ms/query

100 Inference Time Gibbs ms/query WinMine Della Pietra L SPN ms/query

101 Inference Time Gibbs ms/query WinMine Della Pietra L SPN ms/query

102 Inference Time Gibbs ms/query WinMine Della Pietra L SPN ms/query

103 Inference Time Gibbs ms/query WinMine Della Pietra L SPN ms/query

104 Inference Time Gibbs ms/query WinMine Della Pietra L1 1 SPN is 1-3 orders of magnitude faster SPN ms/query

Inference Time 10000 Loopy Belief Propagation 1000 Single

Pietra L1 1 { X 20 Datasets X 10 WinMine Variable

SPNs had higher conditional marginal log-likelihood on 78%

105 Inference Time Loopy Belief Propagation 1000 Single variable marginals Gibbs ms/query { 100 WinMine Della Pietra L1 1 { X 20 Datasets X 10 WinMine Variable proportions Della Pietra L1 10 = 600 Experiments 10x 30x SPNs had higher conditional marginal log-likelihood on 78% of experiments 0.1 (95% higher CLL vs. Gibbs) SPN ms/query

106 Experiment Conclusions SPN learning accuracy is comparable SPN exact inference is 1-3 orders of magnitude faster, and more accurate Inference does not involve tuning settings or diagnosing convergence Inference takes a predictable amount of time Can now apply SPNs to many domains

107 Future Work Other SPN structure and weight learning algorithms Approximating intractable distributions with SPNs Parallelizing SPN learning and inference Code and supplemental results available at spn.cs.washington.edu/learnspn/

Learning the Structure of Sum-Product Networks

Learning the Structure of Sum-Product Networks Robert Gens rcg@cs.washington.edu Pedro Domingos pedrod@cs.washington.edu Department of Computer Science & Engineering, University of Washington, Seattle, WA 98195, USA Abstract Sum-product networks (SPNs)