A Co-Clustering approach for Sum-Product Network Structure Learning

Size: px

Start display at page:

Download "A Co-Clustering approach for Sum-Product Network Structure Learning"

Christopher Walters
5 years ago
Views:

1 Università degli Studi di Bari Dipartimento di Informatica LACAM Machine Learning Group A Co-Clustering approach for Sum-Product Network Antonio Vergari Nicola Di Mauro Floriana Esposito December 8, 2014

2 Outline Introducing (SPNs) definition and properties inference Learning the structure of SPNs current state of the art affinities with hierarchical co-clustering sketching a proposal Experimentation experimental setting results Further Works and Conclusions

3 Probabilistic Graphical Models They can compactly represent joint probability distributions but inference is potentially intractable, exponential in the treewidth One of the bottlenecks is the computation of the partition function: Z = ϕ C (x C ) x X C C

4 X A tractable distribution is an SPN Robert Gens and Pedro Domingos [2013] ``Learning the Structure of '' In: ICML (3), pp

5 X X Y A tractable distribution is an SPN A product of SPNs over different scopes is an SPN (decomposability) Robert Gens and Pedro Domingos [2013] ``Learning the Structure of '' In: ICML (3), pp

6 w 1 w 2 X X Y X Y X Y A tractable distribution is an SPN A product of SPNs over different scopes is an SPN (decomposability) A weighted sum (w i 0) of SPN over the same scope is an SPN (completeness) Nothing else is an SPN Robert Gens and Pedro Domingos [2013] ``Learning the Structure of '' In: ICML (3), pp

7 w 1 w 2 X X Y X Y X Y A tractable distribution is an SPN A product of SPNs over different scopes is an SPN (decomposability) A weighted sum (w i 0) of SPN over the same scope is an SPN (completeness) Nothing else is an SPN They compactly encode the network polynomial over X (multilinear function) Robert Gens and Pedro Domingos [2013] ``Learning the Structure of '' In: ICML (3), pp

8 Inference Given an SPN S over rvs X, these measures are computable in time linear to edges(s) : the partition function of the distribution over X: Z = S( ) exact marginal probabilities for an evidence e: P r(e) = S(e)/S( ) MPE probability for an evidence e and query q (converting S into S max by replacing sum nodes with max nodes) MP E(q, e) = max P r(q, e) = q Smax (e)

9 Inference (example) To compute the marginal probability P r(x) = y Y P r(x, y): X Y X Y

10 Inference (example) To compute the marginal probability P r(x) = y Y P r(x, y): Set the leaf values for X and marginalize over Y by setting their probabilities to X Y X Y 081 1

11 Inference (example) To compute the marginal probability P r(x) = y Y P r(x, y): Set the leaf values for X and marginalize over Y by setting their probabilities to 1 Propagate probabilities (computing sums and products bottom-up) X Y X Y 081 1

12 Inference (example) To compute the marginal probability P r(x) = y Y P r(x, y): Set the leaf values for X and marginalize over Y by setting their probabilities to 1 Propagate probabilities (computing sums and products bottom-up) The exact values is the root value X Y X Y 081 1

13 Deep Architectures Exploit local interactions to save computations by layering a deep architecture X1 X1 X2 X2 X3 X3 X4 X4 X5 X1 X1 X2 X2 X3 X3 X5 X4 Learning a compact structure equals to discover these interactions X5 X5 X4

14 Building the network top down or bottom up by clustering features and/or instances: KMeans on features (k = 2, euclidean distance) to discover similarities, adding layers of sum nodes (fixed length, fully connected)[dennis and Ventura 2012] Merging feature regions bottom-up by a Bayesian-Dirichlet independence test, adding sum layers (fixed length) and reducing edges by maximizing Mutual Information (Information Bottleneck) [Peharz, Geiger, and Pernkopf 2013] LearnSPN: alternating splitting instances (clustering by similarity) and features (independence checking) in a greedy way Builds a tree-like SPN with univariate distributions at the leaves estimated from data [Gens and Domingos 2013]

15 LearnSPN Build a tree-like SPN that maximizes the log-likelihood of the data by alternating splits on instances and features Online Hard-EM with restarts to cluster instances T with an exp prior (P r(c i ) e λ C i X ) on clusters to avoid overfitting: P r(x) = C i C X j X P r(x j C i )P r(c i ) found clusters become sum node children with parameters w i = C i / T Features X are clustered into independent components via a greedy procedure based on a G Test over pairs X i,x j : G(X i, X j ) = 2 x i X i which are associated to product node children c(x i, x j ) log c(x i, x j ) T c(x x j X i )c(x j ) j If T < m consider features independent and create leaves by smooth laplacian frequence estimation (with parameter α)

16 LearnSPN (example) 1 X Y Z W

17 LearnSPN (example) 1 X Y Z W

18 LearnSPN (example) 1 X Y Z W Z W X Y Z W Z 8

19 A Co-Clustering Analogy In the end, LearnSPN builds a network by partitioning the data matrix into blocks by highlighting local interactions One could discover both row and column interactions by the means of a co-clustering technique (co-clusters as blocks in the partitioning) To preserve the decomposability and completeness of the resulting SPN: co-clusters shall be non overlapping the union of the co-clusters shall reconstruct the whole data matrix To have a deep architecture one needs a hierarchy of co-clusters Such several out-of-the box co-clustering algorithms could be used to build SPNs but rows are have not the same meaning of columns! (see caveat)

20 From a Hierarchical Co-Clustering to an SPN Considering a co-cluster hierarchy as represented by two cluster hierarchies (on rows and columns), the construction uses the two hierarchies and the matrix decomposition intuition from LearnSPN Traverse the hierarchies top-down for at most k max levels (to avoid overfitting) and : add a sum node over a split in the row hierarchy estimating the child weights as the proportions of instances in the split add a product node over a split in the column hierarchy if a split contains less than m instances add a product node over single column splits if a split contains only a feature, add a leaf node estimating the univariate distribution with a Laplacian smoothing

21 Hierarchies translation (example) {1, 2, 3, 4} {X, Y, Z, W} {1, 2} {3, 4} {X, Y, Z} {W} {1} {2} {3} {4} {X, Z} {Y} {X} {Z}

22 Hierarchies translation (example) {1, 2, 3, 4} {X, Y, Z, W} {1, 2} {3, 4} {X, Y, Z} {W} {1} {2} {3} {4} {X, Z} {Y} {X} {Z} {1, 2, 3, 4} {X, Y, Z, W}

23 Hierarchies translation (example) {1, 2, 3, 4} {X, Y, Z, W} {1, 2} {3, 4} {X, Y, Z} {W} {1} {2} {3} {4} {X, Z} {Y} {X} {Z} {1, 2} {3, 4} {X, Y, Z, W} {X, Y, Z, W}

24 Hierarchies translation (example) {1, 2, 3, 4} {X, Y, Z, W} {1, 2} {3, 4} {X, Y, Z} {W} {1} {2} {3} {4} {X, Z} {Y} {X} {Z} {1, 2} {1, 2} {3, 4} {3, 4} {X, Y, Z} {W} {X, Y, Z} {W}

25 Co-clustering & SPNs: caveats X Y Z W X Y Z W X Y Z W X Y X Y Z W Z W

26 HiCC-SPN I First co-clustering algorithm used in the experimentation: HiCC Hierarchical Incremental Co-Clustering [Ienco, Pensa, and Meo 2009] almost parameter-free (only the number of iterations to get initial clusters) optimizing an association measure, Goodman-Kruskal's τ divergence: τ CR C C = E C R E CR C C E CR builds an initial co-clustering then recursively splits each partition (yielding two hierarchies) using a Stochastic Local Search approach to find an optimal partitioning: moving an element (row or column) from one cluster to another (cluster creation or deletion are allowed) conditioning on the previous clustering on columns (and vice versa) using restarts for hierarchy levels to differentiate the search

27 HiCC-SPN II Adding a constraint to column splits to preserve independence by checking a conditional mutual information criterion Given a row clustering R, we allow feature x (column) to be moved from cluster C i to C j whether: 1 C i x i C i I(x; x i R) > 1 C j 1 x x j C j I(x; x j R) I(x; y R) = H(x R) H(x y, R) = p(r) p(x i, y j R) p(x i, y j R) log p(x R R x i x y j y i R)p(y j R) where p(r), the prior on the cluster R on the row partitioning R is estimated as the cluster size proportion and p(x i, y j R), p(x i R) and p(y j R) are estimated as the frequencies of the values seen in R

28 Comparing LearnSPN ans HiCC-SPN in the generative framework of graphical models structure learning [Gens and Domingos 2013]: comparing the average log-likelihood on predicting instances from a test set 11 different datasets with binary features, standard in PGMs comparisons [Lowd and Davis 2010] [Haaren and Davis 2012] ranging from classification, recommending, frequent pattern mining 16 to 500 features, 1600 to instances, 001 to 05 density Training 75% Validation 10% Test 15% (no cv) Model selection via grid search in this parameter space: for LearnSPN: λ {02, 06, 08}, ρ {5, 10, 20}, m {1, 100, 400}, α {001, 01, 10} for HiCC-SPN: k {1 : 10}, m {1, 100, 400}, α {0001, 001, 01}

29 Experiment #1: results X T tr T v T te LearnSPN HiCC-SPN NLTCS ± ±311 Plants ± ±970 Audio ± ±1847 Jester ± ±1082 Netflix ± ±590 Accidents ± ±640 Retail ± ±727 Pumsb-star ± ±1269 DNA ± ±67 Book ± ±4733 EachMovie ± ±5635

Conclusions and Further Works Translating a co-cluster hierarchy into an SPN could be promising, the exact and tractable inference could be derived given a row and column partitioning but not every

30 Conclusions and Further Works Translating a co-cluster hierarchy into an SPN could be promising, the exact and tractable inference could be derived given a row and column partitioning but not every co-clustering algorithm is directly usable, plus some representational issues have to be taken into account We are now investigating nested partition models [Rodriguez and Ghosh 2012] that allow for guillotine splits of the data matrix in order to better capture the underlying latent interactions and have deeper insights into SPN structure learning

31 References References Dennis, Aaron and Dan Ventura (2012) ``Learning the Architecture of Using Clustering on Varibles'' In: Advances in Neural Information Processing Systems 25 (NIPS 2012) (cit on p 14) Gens, Robert and Pedro Domingos (2013) ``Learning the Structure of '' In: ICML (3), pp (cit on pp 4 7, 14, 28) Haaren, Jan Van and Jesse Davis (2012) ``Markov Network : A Randomized Feature Generation Approach'' In: AAAI (cit on p 28) Ienco, Dino, Ruggero G Pensa, and Rosa Meo (2009) ``Parameter-Free Hierarchical Co-clustering by n-ary Splits'' In: ECML/PKDD (1) (cit on p 26) Lowd, Daniel and Jesse Davis (2010) ``Learning Markov Network Structure with Decision Trees'' In: Proceedings of the Twenty-seventh International Conference on Machine Learning (cit on p 28) Peharz, Robert, Bernhard Geiger, and Franz Pernkopf (2013) ``Greedy Part-Wise Learning of '' In: European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECMLPKDD 2013) (cit on p 14) Rodriguez, Abel and Kaushik Ghosh (2012) Nested Partition Models (Cit on p 30)

32 References Discussion 1 2 X Y Z W Z W X Y Z W Z {1, 2} {1, 2} {3, 4} {3, 4} {X, Y, Z} {W} {X, Y, Z} {W}

Learning the Structure of Sum-Product Networks. Robert Gens Pedro Domingos

Learning the Structure of Sum-Product Networks Robert Gens Pedro Domingos w 20 10x O(n) X Y LL PLL CLL CMLL Motivation SPN Structure Experiments Review Learning Graphical Models Representation Inference