Templates. for scalable data analysis. 3 Distributed Latent Variable Models. Amr Ahmed, Alexander J Smola, Markus Weimer

Size: px

Start display at page:

Download "Templates. for scalable data analysis. 3 Distributed Latent Variable Models. Amr Ahmed, Alexander J Smola, Markus Weimer"

Avice Perry
6 years ago
Views:

1 Templates for scalable data analysis 3 Distributed Latent Variable Models Amr Ahmed, Alexander J Smola, Markus Weimer Yahoo! Research & UC Berkeley & ANU

2 Variations on a theme inference for mixtures Parallel inference parallelization templates Samplers scaling up LDA

3 Inference for Mixture Models

4 Clustering

5 Clustering

6 Clustering

7 Clustering airline university restaurant

8 Generative Model cluster ID Θ Θ y 1 y 2 y 3 y m y i... x 1 x i x i x m x i objects μ1, Σ1... μk, Σk μj,σj i=1..m j=1..k

9 Generative Model cluster ID Θ Θ y 1 y 2 y 3 y m y i... x 1 x i x i x m x i objects μ1, Σ1... μk, Σk μj,σj i=1..m j=1..k p(x, Y,,µ)= ny i=1 p(x i y i,,µ)p(y i )

10 definetti Any distribution over exchangeable random variables can be written as conditionally independent. p(x 1,...,x n )= Z dp( ) ny i=1 p(x i ) ϴ x i x i Inference should be easy - ϴ x i and x i ϴ

11 Conjugates and Collapsing Exponential Family p(x ) =exp(h (x), i g( )) Conjugate Prior p( µ 0,m 0 )=exp(m 0 hµ 0, i m 0 g( ) h(m 0 µ 0,m 0 )) Posterior p( X, µ 0,m 0 ) / exp (hm 0 µ 0 + mµ[x], i (m 0 + m)g( ) h(m 0 µ 0,m 0 )) Collapsing the natural parameter p(x µ 0,m 0 )=exp(h(m 0 µ 0 + mµ[x],m 0 + m) h(m 0 µ 0,m 0 )) data

12 Conjugates and Collapsing m 0 μ 0 m 0 μ 0 ϴ x i x i collapsed representation definetti

13 Clustering & Topic Models clustering Latent Dirichlet Allocation α prior α prior θ cluster probability θ topic probability y cluster label y topic label x instance x instance

14 Clustering & Topic Models Cluster/ topic distributions membership x = Documents clustering: (0, 1) matrix topic model: stochastic matrix LSI: arbitrary matrices

15 Clustering & Topic Models estimate sample/optimize Cluster/ topic distributions membership x = Documents clustering: (0, 1) matrix topic model: stochastic matrix LSI: arbitrary matrices

16 α θ y x α y x V1 - Brute force maximization Integrate out latent parameters θ and ѱ p(x, Y, ) Discrete maximization problem in Y ѱ β β Hard to implement Overfits a lot (mode is not a typical sample) Hal Daume; Joey Gonzalez Parallelization infeasible

17 α α V2 - Brute force maximization θ θ y x x ѱ ѱ β β Hoffmann, Blei, Bach (in VW) Integrate out latent parameters y p(x,,, ) Continuous nonconvex optimization problem in θ and ѱ Solve by stochastic gradient descent over documents Easy to implement Does not overfit much Great for small datasets Parallelization difficult/impossible Memory storage/access is O(T W) (this breaks for large models) 1M words, 1000 topics = 4GB Per document 1MFlops/iteration

18 α α V3 - Variational approximation θ θ y y x ѱ ѱ β β Blei, Ng, Jordan Approximate intractable joint distribution by tractable factors log p(x) log p(x) D(q(y)kp(y x)) Z = dq(y) [log p(x) + log p(y x) Z = dq(y) log p(x, y) + H[q] Alternating convex optimization problem Dominant cost is matrix matrix multiply Easy to implement Great for small topics/vocabulary Parallelization easy (aggregate statistics) Memory storage is O(T W) (this breaks for large models) Model not quite as good as sampling q(y)]

19 α V4 - Uncollapsed θ y x ѱ β Sampling Sample y ij rest Can be done in parallel Sample θ rest and ѱ rest Can be done in parallel Compatible with MapReduce (only aggregate statistics) Easy to implement Children can be conditionally independent* Memory storage is O(T W) (this breaks for large models) Mixes slowly *for the right model

20 α θ y x α y x V5 - Collapsed Sampling Integrate out latent parameters θ and ѱ Sample one topic assignment y ij X,Y -ij at a time from n ij (t, d)+ p(x, Y, ) t n ij (t, w)+ t ѱ β β Griffiths & Steyvers 2005 n i (d)+ P t Fast mixing Easy to implement Memory efficient n i (t)+ P t Parallelization infeasible (variables lock each other) t t

21 α θ y x α y x V5 - Collapsed Sampling Integrate out latent parameters θ and ѱ Sample one topic assignment y ij X,Y -ij at a time from n ij (t, d)+ p(x, Y, ) t n ij (t, w)+ t ѱ β β Griffiths & Steyvers 2005 n i (d)+ P t Fast mixing Easy to implement Memory efficient n i (t)+ P t Parallelization infeasible (variables lock each other) t t

22 α V6 - Approximating y the Distribution Collapsed sampler per machine n ij (t, d)+ t n ij (t, w)+ t x β Asuncion, Smyth, Welling,... UCI Mimno, McCallum,... UMass n i (d)+ P t Defer synchronization between machines no problem for n(t) big problem for n(t,w) Easy to implement Can be memory efficient Easy parallelization t n i (t)+ P t Mixes slowly/worse likelihood t

23 α V7 - Better Approximations of the Distribution Collapsed sampler n ij (t, d)+ n ij (t, w)+ t t y x β S. and Narayanamurthy, 2009 Ahmed, Gonzalez, et al., 2012 n n i (t)+ P i (d)+ P t t t Make local copies of state Implicit for multicore (delayed updates from samplers) Explicit copies for multi-machine Not a hierarchical model (Welling, Asuncion, et al. 2008) Memory efficient (only need to view its own sufficient statistics) Multicore / Multi-machine Convergence speed depends on synchronizer quality t

24 α V8 - Sequential y x β y x Canini, Shi, Griffiths, 2009 Ahmed et al., 2011 y... p(x, Y, )= x Monte Carlo Integrate out latent θ and ѱ p(x, Y, ) Chain conditional probabilities my p(x i,y i x 1,y 1,...x i 1,y i 1,, ) i=1 For each particle sample y i p(y i x i,x 1,y 1,...x i 1,y i 1,, ) Reweight particle by next step data likelihood p(x i+1 x 1,y 1,...x i,y i,, ) Resample particles if weight distribution is too uneven

25 One pass through data Data sequential parallelization is open problem Nontrivial to implement Sampler is easy V8 - Sequential Monte Carlo Integrate out latent θ and ѱ p(x, Y, ) Chain conditional probabilities Inheritance tree through particles p(x, Y, )= is messy i=1 Need to estimate data likelihood For each particle sample (integration over y), e.g. as part of y i p(y i x i,x 1,y 1,...x i 1,y i 1,, ) sampler Reweight particle by next step data likelihood This is multiplicative update algorithm with log loss... Canini, Shi, Griffiths, 2009 Ahmed et al., 2011 my p(x i,y i x 1,y 1,...x i 1,y i 1,, ) p(x i+1 x 1,y 1,...x i,y i,, ) Resample particles if weight distribution is too uneven

26 Uncollapsed Variational approximation Collapsed natural parameters Collapsed topic assignments Optimization overfits too costly easy parallelization big memory footprint overfits too costly easy to optimize big memory footprint difficult parallelization fast mixing difficult parallelization Sampling slow mixing conditionally independent n.a. approximate inference by delayed updates sampling difficult particle filtering sequential

27 Parallel Inference

28 3 Problems mean variance cluster weight data cluster ID

29 3 Problems data local state global state

30 3 Problems too big for single machine huge only local

31 3 Problems global state Vanilla LDA local state User profiling data global state

32 3 Problems global state Vanilla LDA local state User profiling data global state

33 3 Problems local state is too large does not fit into memory global state is too large network load & barriers does not fit into memory

34 3 Problems local state is too large does stream not local fit data from disk into memory global state is too large network load & barriers does not fit into memory

35 3 Problems local state is too large does stream not local fit data from disk into memory global state is too large network asynchronous load synchronization & barriers does not fit into memory

36 3 Problems local state is too large does stream not local fit data from disk into memory global state is too large network asynchronous load synchronization & barriers does not fit partial view into memory

37 Distribution global replica cluster rack

38 Distribution global replica cluster rack

39 Synchronization Child updates local state Start with common state Child stores old and new state Parent keeps global state Transmit differences asynchronously Inverse element for difference Abelian group for commutativity (sum, log-sum, cyclic group, exponential families) local to global global to local x x old x old x x global x global + x old x x +(x global x old ) x global

40 Synchronization Naive approach (dumb master) Global is only (key,value) storage Local node needs to lock/read/write/unlock master Needs a 4 TCP/IP roundtrips - latency bound Better solution (smart master) Client sends message to master / in queue / master incorporates it Master sends message to client / in queue / client incorporates it Bandwidth bound (>10x speedup in practice) local to global global to local x x old x old x x global x global + x old x x +(x global x old ) x global

41 Distribution Dedicated server for variables Insufficient bandwidth (hotspots) Insufficient memory Select server e.g. via consistent hashing m(x) = argmin m2m h(x, m)

42 Distribution & fault tolerance Storage is O(1/k) per machine Communication is O(1) per machine Fast snapshots O(1/k) per machine (stop sync and dump state per vertex) O(k) open connections per machine O(1/k) throughput per machine m(x) = argmin m2m h(x, m)

43 Distribution & fault tolerance Storage is O(1/k) per machine Communication is O(1) per machine Fast snapshots O(1/k) per machine (stop sync and dump state per vertex) O(k) open connections per machine O(1/k) throughput per machine m(x) = argmin m2m h(x, m)

44 Distribution & fault tolerance Storage is O(1/k) per machine Communication is O(1) per machine Fast snapshots O(1/k) per machine (stop sync and dump state per vertex) O(k) open connections per machine O(1/k) throughput per machine m(x) = argmin m2m h(x, m)

45 Synchronization Data rate between machines is O(1/k) Machines operate asynchronously (barrier free) Solution Schedule message pairs Communicate with r random machines simultaneously local r=1 global

46 Synchronization Data rate between machines is O(1/k) Machines operate asynchronously (barrier free) Solution Schedule message pairs Communicate with r random machines simultaneously local r=1 global

47 Synchronization Data rate between machines is O(1/k) Machines operate asynchronously (barrier free) Solution Schedule message pairs Communicate with r random machines simultaneously local r=1 global

48 Synchronization Data rate between machines is O(1/k) Machines operate asynchronously (barrier free) Solution Schedule message pairs Communicate with r random machines simultaneously local r= < eff. < 0.89 global

49 Synchronization Data rate between machines is O(1/k) Machines operate asynchronously (barrier free) Solution Schedule message pairs Communicate with r random machines simultaneously Use Luby-Rackoff PRPG for load balancing Efficiency guarantee 4 simultaneous connections are sufficient

50 Scalability

51 Samplers

52 Sampling Brute force sampling over large number of items is expensive Ideally want work to scale with entropy of distribution over labels. Sparsity of distribution typically only known after seeing the instances Decompose (dense) probability into dense invariant and sparse variable terms Use fast proposal distribution & rejection sampling

53 Exploiting Sparsity Decomposition (Mimno & McCallum, 2009) Only need to update sparse terms per word p(t w ij ) / w t n(t)+ + w n(t, d = i) n(t)+ + n(t, w = w ij)[n(t, d = i)+ t ] n(t)+ dense but constant sparse Does not work for clustering (too many factors)

54 Exploiting Sparsity Context LDA (Petterson et al., 2009) The smoothers are word and topic dependent p(t w ij ) / (w, t) t n(t)+ (t) + (w, n(t, d = i) t) n(t)+ (t) + n(t, w = w ij)[n(t, d = i)+ t ] n(t)+ (t) topic dependent, dense Simple sparse factorization doesn t work Use Cauchy Schwartz to upper-bound first term X t (w, t) t n(t)+ (t) applek (w, )k n( )+ ( )

55 Collapsed vs Variational Memory requirements (1k topics, 2M words) Variational inference: 8GB RAM (no sparsity) Collapsed sampler: 1.5GB RAM (rare words) Burn-in & sparsity exploit saves a lot 45 Sampling Speed-up (1000 topics) 40 unif doc doc,word Speed-up over standard sampler Iteration Cauchy Schwartz bound multilingual LDA word context smoothing over time

56 Fast Proposal In reality sparsity often not true for real proposal Guess sparse proxy In the storylines model this are the entities

57 Variations on a theme inference for mixtures Parallel inference parallelization templates Samplers scaling up LDA

Partitioning Algorithms for Improving Efficiency of Topic Modeling Parallelization

Partitioning Algorithms for Improving Efficiency of Topic Modeling Parallelization Hung Nghiep Tran University of Information Technology VNU-HCMC Vietnam Email: nghiepth@uit.edu.vn Atsuhiro Takasu National