PTE : Predictive Text Embedding through Large-scale Heterogeneous Text Networks

Size: px

Start display at page:

Download "PTE : Predictive Text Embedding through Large-scale Heterogeneous Text Networks"

Sherman Lang
6 years ago
Views:

1 PTE : Predictive Text Embedding through Large-scale Heterogeneous Text Networks Pramod Srinivasan CS591txt - Text Mining Seminar University of Illinois, Urbana-Champaign April 8, 2016 Pramod Srinivasan 1 of 43

2 Outline Motivation Introduction The Data Sparseness Problem Related Work Overview Information Network Embedding Predictive Text Embedding The Text Representation Heterogeneous Text Network Embedding Bipartite Network Embedding Text Embedding Experiments Statistics of the Dataset Results Results Analysis Conclusion Pramod Srinivasan 2 of 43

3 Motivation Learning a meaningful and effective representation of text Pramod Srinivasan 3 of 43

4 Motivation Learning a meaningful and effective representation of text Critical pre-cursor to several machine learning tasks Pramod Srinivasan 3 of 43

5 Motivation Learning a meaningful and effective representation of text Critical pre-cursor to several machine learning tasks PTE s Goals Capture Semantic Relatedness between words Pramod Srinivasan 3 of 43

6 Motivation Learning a meaningful and effective representation of text Critical pre-cursor to several machine learning tasks PTE s Goals Capture Semantic Relatedness between words Avoid issues such as data sparsity, polysemy and synonymy Pramod Srinivasan 3 of 43

7 Motivation Learning a meaningful and effective representation of text Critical pre-cursor to several machine learning tasks PTE s Goals Capture Semantic Relatedness between words Avoid issues such as data sparsity, polysemy and synonymy Pramod Srinivasan 3 of 43

8 Motivation Introduction The Data Sparseness Problem Related Work Overview Information Network Embedding Predictive Text Embedding The Text Representation Heterogeneous Text Network Embedding Bipartite Network Embedding Text Embedding Experiments Statistics of the Dataset Results Results Analysis Conclusion Pramod Srinivasan 4 of 43

9 Solving the Problem of Data Sparseness( V ) : Text Embedding Representing words and documents in low-dimensional space Words and documents with similar meanings are embedded closely to each other Pramod Srinivasan 5 of 43

10 Motivation Introduction The Data Sparseness Problem Related Work Overview Information Network Embedding Predictive Text Embedding The Text Representation Heterogeneous Text Network Embedding Bipartite Network Embedding Text Embedding Experiments Statistics of the Dataset Results Results Analysis Conclusion Pramod Srinivasan 6 of 43

11 Unsupervised Text Embedding CBOW (Mikolov et al. 2013) Skip-Gram (Mikolov et al. 2013) Paragraph Vector (Le et al. 2014) Pramod Srinivasan 7 of 43

12 Unsupervised Text Embedding CBOW (Mikolov et al. 2013) Skip-Gram (Mikolov et al. 2013) Paragraph Vector (Le et al. 2014) Pramod Srinivasan 7 of 43

13 Unsupervised Text Embedding CBOW (Mikolov et al. 2013) Skip-Gram (Mikolov et al. 2013) Paragraph Vector (Le et al. 2014) Pros Scalable, yet simple model Insensitive parameters Potential to leverage a large amount of unlabeled data, embeddings are general for different tasks Pramod Srinivasan 7 of 43

14 Unsupervised Text Embedding CBOW (Mikolov et al. 2013) Skip-Gram (Mikolov et al. 2013) Paragraph Vector (Le et al. 2014) Pros Scalable, yet simple model Insensitive parameters Potential to leverage a large amount of unlabeled data, embeddings are general for different tasks Cons Fully unsupervised, not tuned for specific tasks Pramod Srinivasan 7 of 43

15 Unsupervised Text Embedding CBOW (Mikolov et al. 2013) Skip-Gram (Mikolov et al. 2013) Paragraph Vector (Le et al. 2014) Pros Scalable, yet simple model Insensitive parameters Potential to leverage a large amount of unlabeled data, embeddings are general for different tasks Cons Fully unsupervised, not tuned for specific tasks Pramod Srinivasan 7 of 43

16 (Deep) Neural Networks Recurrent Neural Networks (Mikolov et al. 2010) Recursive Neural Networks (Socher et al. 2012) Convolutional Neural Network (Kim et al. 2014) Pramod Srinivasan 8 of 43

17 Supervised Learning Model Recurrent Neural Networks (Mikolov et al. 2010) Recursive Neural Networks (Socher et al. 2012) Convolutional Neural Network (Kim et al. 2014) Pros State-of-the-art performance on specific tasks Cons Computationally expensive Require a large number of labeled data, hard to leverage unlabeled data Very sensitive to parameters, difficult to tune Potential to leverage a large amount of unlabeled data, embeddings are general for different tasks Pramod Srinivasan 9 of 43

18 Information Network Embedding Embedding one instance of some mathematical structure contained within another instance. Words that are used together with many similar words are likely to have similar meanings. Pramod Srinivasan 10 of 43

19 Information Network Embedding Embedding one instance of some mathematical structure contained within another instance. Words that are used together with many similar words are likely to have similar meanings. Pramod Srinivasan 10 of 43

20 The Ideal Embedding Model Reduced Parameter Space Pramod Srinivasan 11 of 43

21 The Ideal Embedding Model Reduced Parameter Space Improve generalization Pramod Srinivasan 11 of 43

22 The Ideal Embedding Model Reduced Parameter Space Improve generalization Transfer or share knowledge between entities Pramod Srinivasan 11 of 43

23 The Ideal Embedding Model Reduced Parameter Space Improve generalization Transfer or share knowledge between entities Pramod Srinivasan 11 of 43

24 LINE : Intuition Pramod Srinivasan 12 of 43

25 LINE : Intuition(contd) Pramod Srinivasan 13 of 43

26 LINE : Intuition(contd) Pramod Srinivasan 14 of 43

27 LINE : Intuition(contd) Pramod Srinivasan 15 of 43

28 LINE : Intuition(contd) Pramod Srinivasan 16 of 43

29 LINE : Intuition(contd) Pramod Srinivasan 17 of 43

30 Large-scale Information Network Embedding LINE extends the embedding idea to general information networks, more specifically, it transfers the vertices in a graph to vectors. Pramod Srinivasan 18 of 43

31 Large-scale Information Network Embedding LINE extends the embedding idea to general information networks, more specifically, it transfers the vertices in a graph to vectors. Preserves the first-order and second-order proximity between the vertices Pramod Srinivasan 18 of 43

32 Large-scale Information Network Embedding LINE extends the embedding idea to general information networks, more specifically, it transfers the vertices in a graph to vectors. Preserves the first-order and second-order proximity between the vertices First-order proximity : Observed Link between vertices Pramod Srinivasan 18 of 43

33 Large-scale Information Network Embedding LINE extends the embedding idea to general information networks, more specifically, it transfers the vertices in a graph to vectors. Preserves the first-order and second-order proximity between the vertices First-order proximity : Observed Link between vertices Second-order proximity : Proximity between their neighborhood structures Pramod Srinivasan 18 of 43

34 Large-scale Information Network Embedding LINE extends the embedding idea to general information networks, more specifically, it transfers the vertices in a graph to vectors. Preserves the first-order and second-order proximity between the vertices First-order proximity : Observed Link between vertices Second-order proximity : Proximity between their neighborhood structures Pramod Srinivasan 18 of 43

35 Motivation Introduction The Data Sparseness Problem Related Work Overview Information Network Embedding Predictive Text Embedding The Text Representation Heterogeneous Text Network Embedding Bipartite Network Embedding Text Embedding Experiments Statistics of the Dataset Results Results Analysis Conclusion Pramod Srinivasan 19 of 43

36 Predictive Text Embedding(PTE) Adapt the advantages of unsupervised text embedding approaches but naturally utilize the labeled data for specific tasks How to uniformly represent unsupervised and supervised information? Pramod Srinivasan 20 of 43

37 Predictive Text Embedding(PTE) Adapt the advantages of unsupervised text embedding approaches but naturally utilize the labeled data for specific tasks How to uniformly represent unsupervised and supervised information? Heterogeneous Text Network Different Levels of Word Occurrences : Word-Word Network, Word-Document Network, Word-Label Network Pramod Srinivasan 20 of 43

38 Converting Text Corpora Pramod Srinivasan 21 of 43

39 Partially Labeled Text Corpora Pramod Srinivasan 22 of 43

40 Word-Word and Word-Document Network Both word-document and word-word networks encode unsupervised information Pramod Srinivasan 23 of 43

41 Word-Word and Word-Document Network Both word-document and word-word networks encode unsupervised information Pramod Srinivasan 23 of 43

42 Word-Word and Word-Document Network Both word-document and word-word networks encode unsupervised information Word-Document Topic Models, LDA Pramod Srinivasan 23 of 43

43 Word-Word and Word-Document Network Both word-document and word-word networks encode unsupervised information Word-Document Topic Models, LDA Word-Word Skip-Gram Pramod Srinivasan 23 of 43

44 Word-Word and Word-Document Network Both word-document and word-word networks encode unsupervised information Word-Document Topic Models, LDA Word-Word Skip-Gram The weight between the word and document is its frequency in the document. Pramod Srinivasan 23 of 43

45 Word-Word and Word-Document Network Both word-document and word-word networks encode unsupervised information Word-Document Topic Models, LDA Word-Word Skip-Gram The weight between the word and document is its frequency in the document. The weight between two words is the number of times the two words co-occur in a given window size Pramod Srinivasan 23 of 43

46 Word-Word and Word-Document Network Both word-document and word-word networks encode unsupervised information Word-Document Topic Models, LDA Word-Word Skip-Gram The weight between the word and document is its frequency in the document. The weight between two words is the number of times the two words co-occur in a given window size Pramod Srinivasan 23 of 43

47 Word Label Network Both word-document and word-word networks encode unsupervised information Pramod Srinivasan 24 of 43

48 Word Label Network Both word-document and word-word networks encode unsupervised information Pramod Srinivasan 24 of 43

49 Word Label Network Both word-document and word-word networks encode unsupervised information Word-Label Network encodes the supervised information Pramod Srinivasan 24 of 43

50 Word Label Network Both word-document and word-word networks encode unsupervised information Word-Label Network encodes the supervised information Pramod Srinivasan 24 of 43

51 Heterogeneous Text Network Pramod Srinivasan 25 of 43

52 Heterogeneous Text Network Three Bipartite Networks : Word-word(word-context), word-document and word-label network Encodes different levels of word co-occurency Pramod Srinivasan 25 of 43

53 Heterogeneous Text Network Three Bipartite Networks : Word-word(word-context), word-document and word-label network Encodes different levels of word co-occurency Contains both supervised and unsupervised information Pramod Srinivasan 25 of 43

54 Heterogeneous Text Network Three Bipartite Networks : Word-word(word-context), word-document and word-label network Encodes different levels of word co-occurency Contains both supervised and unsupervised information Embedding a Heterogeneous Text Network, we obtain a very robust and optimized word embeddings for a specific task. Pramod Srinivasan 25 of 43

55 Bipartite Network Embedding Pramod Srinivasan 26 of 43

56 Bipartite Network Embedding(contd.) What is the ideal proximity measure? Hint : It s either First Order or Second Order Pramod Srinivasan 27 of 43

57 Bipartite Network Embedding(contd.) What is the ideal proximity measure? Hint : It s either First Order or Second Order Pramod Srinivasan 27 of 43

58 Bipartite Network Embedding : Second-order proximity For each edge (v i, v j), define a conditional probability: p 2(v j v i) = O 2 = (i,j) E e u T j u i V k=1 e u T k u i w ij log p 2(v j v i). (1) (2) Pramod Srinivasan 28 of 43

59 Heterogeneous Text Network Embedding Pramod Srinivasan 29 of 43

60 Heterogeneous Text Network Embedding Jointly embed three bipartite networks Pramod Srinivasan 29 of 43

61 Heterogeneous Text Network Embedding Jointly embed three bipartite networks Objective Function : O pte = O ww + O wd + O wl, (3) where O ww = w ij log p(v i v j) (4) (i,j) E ww O wd = w ij log p(v i d j) (5) (i,j) E wd O wl = w ij log p(v i l j) (6) (i,j) E wl Pramod Srinivasan 29 of 43

62 Optimization Two different ways of optimization : Depends on when labeled data(word-label network) are utilized. Joint Training Train the unlabeled data and the labeled data simultaneously Pre-training + Fine-Tuning Jointly train the G ww and G wd networks Fine tuning the word embeddings with the word-label network Pramod Srinivasan 30 of 43

63 Learning Word Representations Robust and Optimized word Embeddings for specific tasks Containing different levels of word co-occurences. Encoding both supervised and unsupervised data Given an arbitrary textpiece d = w 1w 2...w n For every w i, the text embedding is given by u i. The vector representation of the embedding can be computed as : d = 1 n n u i (7) i Pramod Srinivasan 31 of 43

64 Motivation Introduction The Data Sparseness Problem Related Work Overview Information Network Embedding Predictive Text Embedding The Text Representation Heterogeneous Text Network Embedding Bipartite Network Embedding Text Embedding Experiments Statistics of the Dataset Results Results Analysis Conclusion Pramod Srinivasan 32 of 43

65 Evaluating Effectiveness of PTE Task : Text Classification Embeddings as Features Classifier : Logistic Regression Compared Algorithms BOW : Classical bag-of-words representation Pramod Srinivasan 33 of 43

66 Evaluating Effectiveness of PTE Task : Text Classification Embeddings as Features Classifier : Logistic Regression Compared Algorithms BOW : Classical bag-of-words representation Unsupervised Text Embedding Skip-Gram Paragraph Vector(PV) LINE, applied to text networks Pramod Srinivasan 33 of 43

67 Evaluating Effectiveness of PTE Task : Text Classification Embeddings as Features Classifier : Logistic Regression Compared Algorithms BOW : Classical bag-of-words representation Unsupervised Text Embedding Skip-Gram Paragraph Vector(PV) LINE, applied to text networks Predictive Text Embedding : incorporates labels a.k.a Supervised Learning Pramod Srinivasan 33 of 43

68 Evaluating Effectiveness of PTE Task : Text Classification Embeddings as Features Classifier : Logistic Regression Compared Algorithms BOW : Classical bag-of-words representation Unsupervised Text Embedding Skip-Gram Paragraph Vector(PV) LINE, applied to text networks Predictive Text Embedding : incorporates labels a.k.a Supervised Learning CNN PTE Pramod Srinivasan 33 of 43

69 Datasets Long Documents Short Documents Mr Twitter Name 20ng Wiki IMDB Corporate Economics Government Market Dblp Train 11,314 1,911,617* 25, ,650 77, , ,040 61,479 7, ,000 Test 7,532 21,000 25, ,827 38,623 69,496 66,020 20,000 3, ,000 V 89, ,881 71, ,740 65, ,960 64,049 22,270 17, ,994 Doc. length #classes *In the Wiki data set, only 42,000 documents are labeled. Pramod Srinivasan 34 of 43

70 Results of Text Classification Long Documents : Unsupervised 20NG Wikipedia IMDB Type Algorithm Micro-F1 Macro-F1 Micro-F1 Macro-F1 Micro-F1 Macro-F1 Word BOW Unsupervised Embedding Skip-gram PVDBOW PVDM LINE(Gww ) LINE(G wd ) LINE(Gww + G wd ) Pramod Srinivasan 35 of 43

71 Results of Text Classification Long Documents : Unsupervised 20NG Wikipedia IMDB Type Algorithm Micro-F1 Macro-F1 Micro-F1 Macro-F1 Micro-F1 Macro-F1 Word BOW Unsupervised Embedding Skip-gram PVDBOW PVDM LINE(Gww ) LINE(G wd ) LINE(Gww + G wd ) Local context-level word co-occurences : LINE(G ww) > Skip-gram Pramod Srinivasan 35 of 43

72 Results of Text Classification Long Documents : Unsupervised 20NG Wikipedia IMDB Type Algorithm Micro-F1 Macro-F1 Micro-F1 Macro-F1 Micro-F1 Macro-F1 Word BOW Unsupervised Embedding Skip-gram PVDBOW PVDM LINE(Gww ) LINE(G wd ) LINE(Gww + G wd ) Local context-level word co-occurences : LINE(G ww) > Skip-gram Document-Level word co-occurences : LINE(G wd ) > PV Pramod Srinivasan 35 of 43

73 Results on Long Documents : Predictive 20NG Wikipedia IMDB Type Algorithm Micro-F1 Macro-F1 Micro-F1 Macro-F1 Micro-F1 Macro-F1 Word BOW Predictive Embedding CNN CNN(pretrain) PTE(G wl ) PTE(Gww + G wl ) PTE(G wd + G wl ) PTE(pretrain) PTE(joint) Pramod Srinivasan 36 of 43

74 Results on Long Documents : Predictive 20NG Wikipedia IMDB Type Algorithm Micro-F1 Macro-F1 Micro-F1 Macro-F1 Micro-F1 Macro-F1 Word BOW Predictive Embedding CNN CNN(pretrain) PTE(G wl ) PTE(Gww + G wl ) PTE(G wd + G wl ) PTE(pretrain) PTE(joint) PTE(joint) > PTE(pretrain) Pramod Srinivasan 36 of 43

75 Results on Long Documents : Predictive 20NG Wikipedia IMDB Type Algorithm Micro-F1 Macro-F1 Micro-F1 Macro-F1 Micro-F1 Macro-F1 Word BOW Predictive Embedding CNN CNN(pretrain) PTE(G wl ) PTE(Gww + G wl ) PTE(G wd + G wl ) PTE(pretrain) PTE(joint) PTE(joint) > PTE(pretrain) PTE(joint) > PTE(G wl ) Pramod Srinivasan 36 of 43

76 Results on Long Documents : Predictive 20NG Wikipedia IMDB Type Algorithm Micro-F1 Macro-F1 Micro-F1 Macro-F1 Micro-F1 Macro-F1 Word BOW Predictive Embedding CNN CNN(pretrain) PTE(G wl ) PTE(Gww + G wl ) PTE(G wd + G wl ) PTE(pretrain) PTE(joint) PTE(joint) > PTE(pretrain) PTE(joint) > PTE(G wl ) PTE(joint) > CNN/CNN(pretrain) Pramod Srinivasan 36 of 43

77 Results on Long Documents : Predictive 20NG Wikipedia IMDB Type Algorithm Micro-F1 Macro-F1 Micro-F1 Macro-F1 Micro-F1 Macro-F1 Word BOW Predictive Embedding CNN CNN(pretrain) PTE(G wl ) PTE(Gww + G wl ) PTE(G wd + G wl ) PTE(pretrain) PTE(joint) PTE(joint) > PTE(pretrain) PTE(joint) > PTE(G wl ) PTE(joint) > CNN/CNN(pretrain) Pramod Srinivasan 36 of 43

78 Results on Short Documents : Unsupervised DBLP MR Twitter Type Algorithm Micro-F1 Macro-F1 Micro-F1 Macro-F1 Micro-F1 Macro-F1 Word BOW Unsupervised Embedding Skip-gram PVDBOW PVDM LINE(Gww ) LINE(G wd ) LINE(Gww + G wd ) Pramod Srinivasan 37 of 43

79 Results on Short Documents : Unsupervised DBLP MR Twitter Type Algorithm Micro-F1 Macro-F1 Micro-F1 Macro-F1 Micro-F1 Macro-F1 Word BOW Unsupervised Embedding Skip-gram PVDBOW PVDM LINE(Gww ) LINE(G wd ) LINE(Gww + G wd ) Local context-level word co-occurences : LINE(G ww) > Skip-gram Pramod Srinivasan 37 of 43

80 Results on Short Documents : Unsupervised DBLP MR Twitter Type Algorithm Micro-F1 Macro-F1 Micro-F1 Macro-F1 Micro-F1 Macro-F1 Word BOW Unsupervised Embedding Skip-gram PVDBOW PVDM LINE(Gww ) LINE(G wd ) LINE(Gww + G wd ) Local context-level word co-occurences : LINE(G ww) > Skip-gram Document-Level word co-occurences LINE(G wd ) > PV LINE(G ww + G wd ) > LINE(G ww) > LINE(G wd) Pramod Srinivasan 37 of 43

81 Results on Short Documents : Predictive DBLP MR Twitter Type Algorithm Micro-F1 Macro-F1 Micro-F1 Macro-F1 Micro-F1 Macro-F1 Word BOW Predictive Embedding CNN CNN(pretrain) PTE(G wl ) PTE(Gww + G wl ) PTE(G wd + G wl ) PTE(pretrain) PTE(joint) Pramod Srinivasan 38 of 43

82 Results on Short Documents : Predictive DBLP MR Twitter Type Algorithm Micro-F1 Macro-F1 Micro-F1 Macro-F1 Micro-F1 Macro-F1 Word BOW Predictive Embedding CNN CNN(pretrain) PTE(G wl ) PTE(Gww + G wl ) PTE(G wd + G wl ) PTE(pretrain) PTE(joint) Pramod Srinivasan 38 of 43

83 Results on Short Documents : Predictive DBLP MR Twitter Type Algorithm Micro-F1 Macro-F1 Micro-F1 Macro-F1 Micro-F1 Macro-F1 Word BOW Predictive Embedding CNN CNN(pretrain) PTE(G wl ) PTE(Gww + G wl ) PTE(G wd + G wl ) PTE(pretrain) PTE(joint) PTE(joint) > PTE(pretrain) Pramod Srinivasan 38 of 43

84 Results on Short Documents : Predictive DBLP MR Twitter Type Algorithm Micro-F1 Macro-F1 Micro-F1 Macro-F1 Micro-F1 Macro-F1 Word BOW Predictive Embedding CNN CNN(pretrain) PTE(G wl ) PTE(Gww + G wl ) PTE(G wd + G wl ) PTE(pretrain) PTE(joint) PTE(joint) > PTE(pretrain) PTE(joint) > PTE(G wl ) Pramod Srinivasan 38 of 43

85 Results on Short Documents : Predictive DBLP MR Twitter Type Algorithm Micro-F1 Macro-F1 Micro-F1 Macro-F1 Micro-F1 Macro-F1 Word BOW Predictive Embedding CNN CNN(pretrain) PTE(G wl ) PTE(Gww + G wl ) PTE(G wd + G wl ) PTE(pretrain) PTE(joint) PTE(joint) > PTE(pretrain) PTE(joint) > PTE(G wl ) PTE(joint) > CNN/CNN(pretrain) Pramod Srinivasan 38 of 43

86 Results on Short Documents : Predictive DBLP MR Twitter Type Algorithm Micro-F1 Macro-F1 Micro-F1 Macro-F1 Micro-F1 Macro-F1 Word BOW Predictive Embedding CNN CNN(pretrain) PTE(G wl ) PTE(Gww + G wl ) PTE(G wd + G wl ) PTE(pretrain) PTE(joint) PTE(joint) > PTE(pretrain) PTE(joint) > PTE(G wl ) PTE(joint) > CNN/CNN(pretrain) Pramod Srinivasan 38 of 43

87 Motivation Introduction The Data Sparseness Problem Related Work Overview Information Network Embedding Predictive Text Embedding The Text Representation Heterogeneous Text Network Embedding Bipartite Network Embedding Text Embedding Experiments Statistics of the Dataset Results Results Analysis Conclusion Pramod Srinivasan 39 of 43

88 Unsupervised Embedding Pramod Srinivasan 40 of 43

89 Unsupervised Embedding Long Documents Documemt-level word co-occurences are more useful than local Context-Level word co-occurences. No improvement observed when these two co-occurences are combined. Pramod Srinivasan 40 of 43

90 Unsupervised Embedding Long Documents Documemt-level word co-occurences are more useful than local Context-Level word co-occurences. No improvement observed when these two co-occurences are combined. Short Documents Local context-level word co-occurences are more useful than document-level word co-occurences. Combination further improves the embedding. Pramod Srinivasan 40 of 43

91 Summary Pramod Srinivasan 41 of 43

92 Summary Predictive Text Embedding Adapt the advantages of unsupervised text embedding approaches Naturally incorporate the labeled data Encode unsupervised and supervised information through Large-scale heterogeneous information networks Outperform or comparable to sophisticated methods such as CNN Outperform CNN on long documents Comparable to CNN on short documents Pramod Srinivasan 41 of 43

93 Takeaway Predictive Text Embedding Given a large collection of text data with unlabeled and labeled information, PTE aims to learn low-dimensional representations of words by embedding the heterogeneous representations of words by embedding the heterogeneous text network constructed from the collection into a low dimensional vector space. Pramod Srinivasan 42 of 43

94 Thank You! Pramod Srinivasan 43 of 43

FastText. Jon Koss, Abhishek Jindal

FastText. Jon Koss, Abhishek Jindal FastText Jon Koss, Abhishek Jindal FastText FastText is on par with state-of-the-art deep learning classifiers in terms of accuracy But it is way faster: FastText can train on more than one billion words