Clustering Relational Data using the Infinite Relational Model

Size: px

Start display at page:

Download "Clustering Relational Data using the Infinite Relational Model"

Alban Griffin
5 years ago
Views:

1 Clustering Relational Data using the Infinite Relational Model Ana Daglis Supervised by: Matthew Ludkin September 4, 2015 Ana Daglis Clustering Data using the Infinite Relational Model September 4, / 29

2 Outline 1 Clustering 2 Model 3 Gibbs Sampling Methodology Results 4 Methodology Results 5 Future Work Ana Daglis Clustering Data using the Infinite Relational Model September 4, / 29

Clustering Clustering Cluster Analysis: Given an unlabelled data, want algorithms that automatically group the datapoints into coherent subsets/clusters.

3 Clustering Clustering Cluster Analysis: Given an unlabelled data, want algorithms that automatically group the datapoints into coherent subsets/clusters. Applications: recommendation engines (Netflix, itunes, Quora,...) image compression targeted marketing Google News Ana Daglis Clustering Data using the Infinite Relational Model September 4, / 29

4 Model Infinite Relational Model Infinite Relational Model (IRM) is a model, in which each node is assigned to a cluster. The number of clusters is not known initially and is learned from the data as part of the statistical inference. IRM is represented by the following parameters: z i - cluster, containing node i, for i = 1,..., n. φ i,j - probability of an edge between i-th and j-th clusters. Ana Daglis Clustering Data using the Infinite Relational Model September 4, / 29

5 Model Assumptions Given the adjacency matrix of the graph, X, as our data, we assume that X i,j Bernoulli(φ zi,z j ). Since z and φ are not known, hierarchical and beta priors respectively are imposed: { z CRP(A) φ i,j Beta(a, b). Ana Daglis Clustering Data using the Infinite Relational Model September 4, / 29

6 Model Chinese Restaurant Process (CRP(A)) The Chinese restaurant process is a discrete process, whose value at time n is the partition of 1, 2,..., n. At time n = 1, have trivial partition {{1}}. At time n + 1, element n + 1 is either: 1 added to an existing block with probability b /(n + A), where b is the size of the block, or 2 creates a completely new block with probability A/(n + A). Ana Daglis Clustering Data using the Infinite Relational Model September 4, / 29

Model Chinese Restaurant Process (CRP(A)) 1 0 0 Ana Daglis

7 Model Chinese Restaurant Process (CRP(A)) Ana Daglis Clustering Data using the Infinite Relational Model September 4, / 29

8 Model Chinese Restaurant Process (CRP(A)) 1 1+A A 1+A 0 Ana Daglis Clustering Data using the Infinite Relational Model September 4, / 29

9 Model Chinese Restaurant Process (CRP(A)) 1 2+A 1 2+A A 2+A Ana Daglis Clustering Data using the Infinite Relational Model September 4, / 29

10 Model Chinese Restaurant Process (CRP(A)) 1 3+A 2 3+A A 3+A Ana Daglis Clustering Data using the Infinite Relational Model September 4, / 29

11 Gibbs Sampling Methodology Gibbs Sampling Want: a sample from a multivariate distribution θ = (θ 1, θ 2,..., θ d ). Algorithm: 1 Initialize with θ = (θ (0) 1, θ(0) 2,..., θ(0) d ). 2 For i = 1, 2,..., n, Simulate θ (i) 1 from the conditional θ 1 (θ (i 1) 2,..., θ (i 1) d ) Simulate θ (i) 2 from the conditional θ 2 (θ (i) 1, θ(i 1) 3,..., θ (i 1) d )... Simulate θ (i) d from the conditional θ d (θ (i) 1, θ(i) 2,..., θ(i) d 1 ). 3 Discard the first k iterations and estimate the posterior distribution using (θ (k+1) 1, θ (k+1) 2,..., θ (k+1) ),..., (θ (n) 1, θ(n) 2,..., θ(n) d ). d Ana Daglis Clustering Data using the Infinite Relational Model September 4, / 29

12 Gibbs Sampling Methodology Gibbs Sampling We use the Gibbs sampling to infer the posterior distribution of z. The cluster assignments, z i, are iteratively sampled from their conditional distribution, P(z i = k z \i, X ) P(X z)p(z i = k z \i ), where z \i denotes all cluster assignments except z i. Ana Daglis Clustering Data using the Infinite Relational Model September 4, / 29

13 Gibbs Sampling Methodology Simulated Data We applied the Gibbs sampling algorithm to a simulated network with the following parameters: 96 nodes split into 6 blocks φ i,i = 0.85, for i = 1,...n φ i,j = 0.05, for i j a = b = 1 for uniform prior A = 1. (a) Simulated network Ana Daglis Clustering Data using the Infinite Relational Model September 4, / 29

14 Gibbs Sampling Results Simulated Data We applied the Gibbs sampling algorithm to a simulated network with the following parameters: 96 nodes split into 6 blocks φ i,i = 0.85, for i = 1,...n φ i,j = 0.05, for i j a = b = 1 for uniform prior A = 1. (b) Supplied network Ana Daglis Clustering Data using the Infinite Relational Model September 4, / 29

15 Gibbs Sampling Results Block structure obtained Ana Daglis Clustering Data using the Infinite Relational Model September 4, / 29

16 Gibbs Sampling Results Trace-plot of the number of blocks 6 Number of blocks Iteration Ana Daglis Clustering Data using the Infinite Relational Model September 4, / 29

17 Gibbs Sampling Results Gibbs Sampling Summary The algorithm fails to split the data into 6 clusters within iterations, and is stuck in five-cluster configuration for a long time. The main problem with the Gibbs sampler is that it is slow to converge, and it often becomes trapped in a local mode (5 blocks in this case). A possible improvement is the split-merge algorithm, which updates simultaneously a group of nodes and avoids these problems. Ana Daglis Clustering Data using the Infinite Relational Model September 4, / 29

18 Methodology Algorithm: 1 Select two distinct nodes, i and j, uniformly at random. 2 If i and j belong to the same cluster, split that cluster into two by assigning elements to either of the two clusters independently with equal probability. 3 If i and j belong to different clusters, merge those clusters. 4 Evaluate Metropolis-Hastings acceptance probability. If accepted, the new cluster assignment becomes the next step of the algorithm. Otherwise, the initial cluster assignment remains as the next state. [ a(z, z) = min 1, q(z z )P(z )L(X z ] ) q(z, z)p(z)l(x z) where q is proposal probability, P(z) prior, L(X z) likelihood. Ana Daglis Clustering Data using the Infinite Relational Model September 4, / 29

19 Methodology Ana Daglis Clustering Data using the Infinite Relational Model September 4, / 29

20 Methodology Ana Daglis Clustering Data using the Infinite Relational Model September 4, / 29

21 Methodology Ana Daglis Clustering Data using the Infinite Relational Model September 4, / 29

22 Methodology Ana Daglis Clustering Data using the Infinite Relational Model September 4, / 29

23 Methodology Ana Daglis Clustering Data using the Infinite Relational Model September 4, / 29

24 Methodology Ana Daglis Clustering Data using the Infinite Relational Model September 4, / 29

25 Results Gibbs Sampler + Split-Merge We applied the Gibbs sampler together with the split-merge algorithm to the earlier network. For every nine full Gibbs sampling scans, one split-merge step was used. The algorithm appropriately splits the data into six clusters, has short burn-in time and mixes well. Ana Daglis Clustering Data using the Infinite Relational Model September 4, / 29

26 Results Block structure obtained Ana Daglis Clustering Data using the Infinite Relational Model September 4, / 29

27 Results Trace-plot of the number of blocks 7 Number of blocks Iteration Ana Daglis Clustering Data using the Infinite Relational Model September 4, / 29

28 Future Work Future Work Assess the performance of the algorithms when the blocks significantly vary in size. Evaluate the complexities of the algorithms. Explore more advanced algorithms (such as the Restricted Gibbs Sampling Split-Merge). Ana Daglis Clustering Data using the Infinite Relational Model September 4, / 29

29 Future Work References Schmidt, M. N. and Mørup, M. (2013). Non-parametric Bayesian modeling of complex networks. IEEE Signal Processing Magazine, 30: Jain, S. and Neal, R. M. (2004). A split-merge Markov chain Monte Carlo procedure for the Dirichlet process mixture model. Journal of Computational and Graphical Statistics, 13: Ana Daglis Clustering Data using the Infinite Relational Model September 4, / 29

Bayesian Statistics Group 8th March Slice samplers. (A very brief introduction) The basic idea

Bayesian Statistics Group 8th March 2000 Slice samplers (A very brief introduction) The basic idea lacements To sample from a distribution, simply sample uniformly from the region under the density function