LDA for Big Data - Outline

Size: px

Start display at page:

Download "LDA for Big Data - Outline"

Mercy May
5 years ago
Views:

1 LDA FOR BIG DATA 1

2 LDA for Big Data - Outline Quick review of LDA model clustering words-in-context Parallel LDA ~= IPM Fast sampling tricks for LDA Sparsified sampler Alias table Fenwick trees LDA for text à LDA-like models for graphs 2

3 Recap: The LDA Topic Model 3

4 Unsupervised NB vs LDA one class prior α π α different class distrib θ for each doc θ d one Y per doc Y W one Z per word ZY di W di N d N d D D β γ K β 4 γ k K

5 LDA topics: top words w by Pr(w Z=k) Z=13 Z=22 Z=27 Z=19 5

6 LDA s view of a document Mixed membership model 6

7 LDA and (Collapsed) Gibbs Sampling Gibbs sampling works for any directed model! - Applicable when joint distribution is hard to evaluate but conditional distribution is known - Sequence of samples comprises a Markov Chain - Stationary distribution of the chain is the joint distribution Key capability: estimate distribution of one latent variables given the other latent variables and observed variables. 7

$ZY di fraction of time Z=t in doc d fraction of time W=w$ in topic t W di ignores a detail counts should not

8 Recap: Collapsed Sampling for LDA α θ d Pr(Z E+) Pr(E- Z) ZY di fraction of time Z=t in doc d fraction of time W=w in topic t W di ignores a detail counts should not include the Z di being sampled N d D Only sample the Z s β 8 γ k K

9 PARALLEL LDA 9

10 JMLR

11 Observation How much does the choice of z depend on the other z s in the same document? quite a lot How much does the choice of z depend on the other z s in elsewhere in the corpus? maybe not so much depends on Pr(w t) but that changes slowly Can we parallelize Gibbs and still get good results? 11

12 Question Can we parallelize Gibbs sampling? formally, no: every choice of z depends on all the other z s Gibbs needs to be sequential just like SGD 12

13 What if you try and parallelize? Split document/term matrix randomly and distribute to p processors.. then run Approximate Distributed LDA This is iterative parameter mixing 13

14 What if you try and parallelize? All-Reduce cost D=#docs W=#word(types) K=#topics N=words in corpus 14

15 15

16 16

17 17

18 18

19 Later work. Algorithms: Distributed variational EM Asynchronous LDA (AS-LDA) Approximate Distributed LDA (AD-LDA) Ensemble versions of LDA: HLDA, DCM-LDA Implementations: GitHub Yahoo_LDA not Hadoop, special-purpose communication code for synchronizing the global counts Alex Smola, YahooàCMU Mahout LDA Andy Schlaikjer, CMUàTwitter 19

20 FAST SAMPLING FOR LDA 20

21 RECAP More detail linear in corpus size and #topics time and space 21

22 RECAP each iteration: linear in corpus size resample: linear in #topics most of the time is resampling 22

23 RECAP random z=1 z=2 z=3 unit height 1. You spend a lot of time sampling 2. There s a loop over all topics here in the sampler 23

24 KDD 09 24

25 random z=1 z=2 z=3 unit height

26 z=s+r+q z=2 z=3 height s z=2 z=3 r z=1 z=2 z=3 q 26

27 Draw random U from uniform[0,1] If U<s: lookup U on line segment with tic-marks at α 1 β/(βv + n. 1 ), α 2 β/(βv + n. 2 ), lizer = s+r+q random U height s 27

28 If U<s: lookup U on line segment with tic-marks at α 1 β/(βv + n. 1 ), α 2 β/(βv + n. 2 ), If s<u<r: lookup U on line segment for r Only need to check t such that n t d >0 z=s+r+q 28

29 If U<s: lookup U on line segment with tic-marks at α 1 β/(βv + n. 1 ), α 2 β/(βv + n. 2 ), If s<u<s+r: lookup U on line segment for r If s+r<u: lookup U on line segment for q z=s+r+q Only need to check t such that n w t >0 29

30 Only need to check occasionally (< 10% of the time) Only need to check t such that n t d >0 z=s+r+q Only need to check t such that n w t >0 30

31 Only need to store (and maintain) total words per topic and α s,β,v Trick; count up n t d for d when you start working on d and update incrementally Only need to store n t d for current d z=s+r+q Need to store n w t for each word, topic pair??? 31

32 z=1 z=2 z=3 z=2 z=3 z=1 1. Precompute, for each t, 2. Quickly find t s such that n w t is large for w Most (>90%) of the time and space is here z=s+r+q Need to store n w t for each word, topic pair??? 32

than frequency of w no larger than #topics encode (t,n) as a bit vector n in the high-order

33 1. Precompute, for each t, 2. Quickly find t s such that n w t is large for w associate each w with an int array no larger than frequency of w no larger than #topics encode (t,n) as a bit vector n in the high-order bits t in the low-order bits keep ints sorted in descending order Most (>90%) of the time and space is here Need to store n w t for each word, topic pair??? 33

34 34

35 Other Fast Samplers for LDA 35

36 Alias tables O(K) Basic problem: how can we sample from a biased coin quickly? If the distribution changes slowly maybe we can do some preprocessing and then sample multiple times. Proof of concept: generate r~uniform and use a binary tree r in (23/40,7/10] O(log2K) 36

37 Alias tables Basic problem: how can we sample from a biased die quickly? O(K) 37

38 Alias tables Another idea Simulate the dart with two drawn values: rx è int(u1*k) ry è u1*p max keep throwing till you hit a stripe 38

Alias tables An even more clever idea: minimize the brown space

average probability, not the maximum probability, and cutting and

You can always do this using only two colors in each column of

39 Alias tables An even more clever idea: minimize the brown space (where the dart misses ) by sizing the rectangle s height to the average probability, not the maximum probability, and cutting and pasting a bit. You can always do this using only two colors in each column of the final alias table and the dart never misses! mathematically speaking 39

40 LDA with Alias Sampling [KDD 2014] Sample Z s with alias sampler Don t update the sampler with each flip: Correct for staleness with Metropolis-Hastings algorithm 40

41 41

42 Yet More Fast Samplers for LDA 42

43 WWW

44 Fenwick Tree (1994) O(K) Basic problem: how can we sample from a biased die quickly. and update quickly? maybe we can use a binary tree. r in (23/40,7/10] O(log2K) 44

45 Data structures and algorithms LSearch: linear search 45

46 Data structures and algorithms BSearch: binary search store cumulative probability 46

47 Data structures and algorithms Alias sampling.. 47

48 Data structures and algorithms Fenwick tree 48

49 Data structures and algorithms Fenwick tree βq: dense, changes slowly, re-used for each word in a document Sampler is: Binary search r: sparse, a different one is needed for each uniq term in doc 49

50 Speedup vs std LDA sampler (1024 topics) 50

51 Speedup vs std LDA sampler (10k-50k opics) 51

52 And Parallelism. 52

53 Second idea: you can sample document-by-document or word-byword. or. use a MF-like approach to distributing the data. 53

54 54

55 Multi-core NOMAD method 55

56 LDA-LIKE MODELS FOR GRAPHS 56

57 Network Datasets UBMCBlog AGBlog MSPBlog Cora Citeseer 57

58 Motivation Social graphs seem to have some aspects of randomness small diameter, giant connected components,.. some structure homophily, scale-free degree dist? How do you model it? 58

59 More terms Stochastic block model, aka Block-stochastic matrix : Draw n i nodes in block i With probability p ij, connect pairs (u,v) where u is in block i, v is in block j Special, simple case: p ii =q i, and pij=s for all i j Question: can you fit this model to a graph? find each p ij and latent nodeàblock mapping 59

60 Not? football 60

61 Not? books 61

62 Stochastic Block models: assume 1) nodes w/in a block z and 2) edges between blocks z p,z q are exchangeable a b zp z p z q p N a pq N 2 62

63 Another mixed membership block model 63

64 Another mixed membership block model z=(zi,zj) is a pair of block ids n z = #pairs z q z1, i = #links to i from block z1 q z1,. = #outlinks in block z1 δ = indicator for diagonal M = #nodes 64

65 Experiments Balasubramanyan, Lin, Cohen, NIPS w/s

CS281 Section 9: Graph Models and Practical MCMC

CS281 Section 9: Graph Models and Practical MCMC Scott Linderman November 11, 213 Now that we have a few MCMC inference algorithms in our toolbox, let s try them out on some random graph models. Graphs