Multi-label Classification. Jingzhou Liu Dec

Size: px

Start display at page:

Download "Multi-label Classification. Jingzhou Liu Dec"

Christian Rice
5 years ago
Views:

1 Multi-label Classification Jingzhou Liu Dec

2 Introduction Multi-class problem, Training data (x $, y $ ) ( ), x $ X R., y $ Y = 1,2,, L Learn a mapping f: X Y Each instance x $ is associated with a single relevant label y $ Multi-label problem Training data (x $, y $ ) ( ), x $ X R., y $ 0,1 : Learn a mapping g: X 0,1 : Each instance x $ is associated with a set of relevant labels, denoted by label vector y $

3 Introduction Applications Text categorization Image/video annotation Query/keyword suggestions Recommender system

4 Overview Introduction Baseline Methods Key Challenges Embedding-based Methods Tree-based Methods LETOR method Conclusion

5 1-vs-All Train a binary classifier for each label For label j, construct training data D > = { x $, y i = 1.. n} Learn a binary classifier : X R For unseen instance x, predict its associated label set Y by running all L individual binary classifiers Y = x > 0 Training: O(L I F K (n, D)) Testing: O(L I H K (D))

6 1-vs-1 Train a binary classifier for each pair of labels For each pair of label (j, k), construct training data = {(x $, y y $ P ) y y $ P, i = 1.. n} Learn a binary classifier : X R For unseen instance x, take the overall votes on each φ x, j = PY( I g P@ x 0 + : PY@[( I x > 0 Training: O(L \ I F K (n, D)) Testing: O(L \ I H K (D))

7 ML-kNN For unseen instance x, let N(x) denote its k nearest neighbors And calculate = I(y = 1) x _ N(x) Predicted label set Y = {j ` ` > 1} where is the event that x has label j Training: O(n \ D + Lnk) Testing: O(nD + Lk)

8 Decision Tree Starting from the root, identify the feature and the corresponding splitting value which maximizes the information gain. IG T, l, v = H T T[ T H T[ TX H TX T Where T [ = x $, y $ x f $ > v, T X = {(x $, y $ x f $ v)} For unseen instance x, traverse the tree until reaching a leaf node, and predict labels that are in the majority of instances in the leaf. Training: O(nDLh) Testing: O(h + L fijk n fijk )

9 Overview Introduction Baseline Methods Key Challenges Embedding-based Methods Tree-based Methods LETOR method Conclusion

10 Key Challenges Consider the output space: 2 :! If treat labels separately: we lose correlations among labels 1-vs-All, ML-kNN, Decision Tree: no correlations 1-vs-1: pair-wise correlation It is crucial to exploit label correlations information.

11 Key Challenges In practice, n, L is large, and D is usually large but sparse. train n, L~million? We need testing time to be sublinear to L test 1-vs-All O(L I F K (n, D)) O(L I H K (D)) 1-vs-1 O(L \ I F K (n, D)) O(L \ I H K (D)) ML-kNN O(n \ D + Lnk) O(nD + Lk) Decision Tree O(nDLh) O(h + L fijk n fijk )

12 Overview Introduction Baseline Methods Key Challenges Embedding-based Methods Tree-based Methods LETOR method Conclusion

13 Embedding-based Methods Reduce label and/or feature dimensions. Given a training example x $, y $, compress y $ to a Lm-dimensional space z $ = Uy $, U R : m : (e.g. SVD) Learn V R : m. s.t. Vx $ z $ = Uy $ For unseen instance x, recover label vector y = U (Vx) Where U (I) is a decompression function (e.g. matrix multiplication, or involves some learning)

14 Low rank Empirical risk minimization for Multilabel Learning (LEML) Yu et al. ICML 2014 Directly optimize for y = U (Vx) Prediction is parameterized as f x; Z = Z w x, where Z R. : And loss function is decomposable l y, f x; Z = y x;z

15 Low rank Empirical risk minimization for Multilabel Learning (LEML) Yu et al. ICML 2014 To exploit label correlations, assume Z is controlled by only a small number of latent factors Zz = arg min ) : J(Z) = y y l(y x $ ; Z ) s. t. rank(z) k + λ I r Z

16 Low rank Empirical risk minimization for Multilabel Learning (LEML) Yu et al. ICML 2014 Decompose Z = WH w, where W R. P, H R : P r Z = r ( W + r \ (H) arg min,ˆ ) : J(W, H) = y y l(y x $ w ) + λ 2 H ( ) min J(W X(, H) ˆ W ( ) min J(W, H( ) ) W Š \ + H Š \

17 Low rank Empirical risk minimization for Multilabel Learning (LEML) Yu et al. ICML 2014

18 Low rank Empirical risk minimization for Multilabel Learning (LEML) Yu et al. ICML 2014

19 Low rank Empirical risk minimization for Multilabel Learning (LEML) Yu et al. ICML 2014 Exploit label correlations through a low rank assumption A general ERM framework Fairly scalable Prediction f x; Z = HW w x Still O k I L + D complexity Lose information during compression/decompression

20 Sparse Local Embedding for Extreme Classification (SLEEC) Bhatia et al. NIPS 2015 Project feature and label vectors to the same Lm-dimensional space x $ Vx $, V R : m. y $ z $, z $ R : m s. t. Vx $ z $ z $ preserves local properties of y $ At test phase, for unseen instance x, perform knn search in the Lmdimensional space with Vx Can speed up testing by clustering training examples

21 Sparse Local Embedding for Extreme Classification (SLEEC) Bhatia et al. NIPS 2015 Instead of using a global projection that tries to preserve distances between all pairs of label vectors, only preserve those between nearest neighbors Ω denotes the set of nearest neighbor pairs i, j Ω j N(i) min R P Y w Y P Z w Z Š \ +λ Z ( min R P Y w Y P X w V w VX Š \ +λ V Š \ + μ VX (

22 Sparse Local Embedding for Extreme Classification (SLEEC) Bhatia et al. NIPS 2015 Divide the optimization into two phases First min R Then P Y w Y P Z w Z Š \ = min œ žÿ( ) :m min R Z VX Š \ +λ V Š \ + μ VX ( P Y w Y P M Š \

23 Sparse Local Embedding for Extreme Classification (SLEEC) Bhatia et al. NIPS 2015

24 Sparse Local Embedding for Extreme Classification (SLEEC) Bhatia et al. NIPS 2015

25 Sparse Local Embedding for Extreme Classification (SLEEC) Bhatia et al. NIPS 2015 Preserve only local distances of label vectors, through nonlinear mapping Use knn classifier and clustering on prediction, avoid a decompression step

26 Overview Introduction Baseline Methods Key Challenges Embedding-based Methods Tree-based Methods LETOR method Conclusion

27 Tree-based Methods Learn a hierarchy over the feature space, and assign a subset of labels to each leaf node At test phase, an unseen instance traverses from root to its corresponding leaf, and prediction is restricted to labels in this leaf Enjoy efficiency in both training and testing when the tree is balanced, and the number of instances in a leaf is relatively small

28 Multi-Label Random Forests (MLRF) Agrawal et al. WWW 2013 Learn an ensemble of decision trees Minimize Gini index over a set of randomly selected features At test time, label distributions are aggregated over the leaf nodes in each tree

29 Label Partitioning for Sublinear Ranking (LPSR) Weston et al. ICML 2013 Try to address slow prediction problem Suppose already trained a base scorer f(x, y) for an instance x and a single label y, f x = f x, 1,, f x, L w Two components Input Partitioner Label Assignment

30 Label Partitioning for Sublinear Ranking (LPSR) Weston et al. ICML 2013 Input Partitioner Over feature space Examples for which the base scorer performs well should be prioritized Hierarchical k-means ) \ min y P f x $, y $ x $ c f_ f_ $Y( where l $ is cluster assignment of x $, c $ is the centroid of the ith cluster P I,I is a metric to maximize (e.g. precision@k)

31 Label Partitioning for Sublinear Ranking (LPSR) Weston et al. ICML 2013 Label Assignment Assign a set of labels α 0,1 : to each leaf s. t. precision in each leaf is maximized through a relaxed integer programming Directly optimize the measurement of interest At test time, rank the labels assigned to the leaf according to base scorer Base scorer O(nLF k n, D ) Test O(L fijk H k D )

32 Fast extreme Multi-label Learning (FastXML) Prabhu&Varma KDD 2014 Directly optimize the measurement of interest (ndcg) Jointly learn input partitioner (hyperplane) and label assignment (a ranking of labels) Learn an ensemble of trees instead of base scorers Aggregate label distributions over all leaf nodes

33 Fast extreme Multi-label Learning (FastXML) Prabhu&Varma KDD 2014 min w,δ,r,r ) w ( + y C δ $ log(1 + e X _w x _ ) C $Y( ) $Y( ( \ 1 + δ $ L ).±² (r [, y $ ) ) C ( $Y( \ 1 δ $ L ).±² (r X, y $ ) w R. is the hyperplane cutting the current feature space δ +1, 1 ) indicating which half each instance is assigned to r [, r X are rankings of labels assigned to each half

34 Fast extreme Multi-label Learning (FastXML) Prabhu&Varma KDD 2014 min w,δ,r,r ) w ( + y C δ $ log(1 + e X _w x _ ) C $Y( ) $Y( ( \ 1 + δ $ L ).±² (r [, y $ ) ) C ( $Y( \ 1 δ $ L ).±² r X, y $ If w and δ is fixed, optimize for r [ and r X : O(nLm + Lm log Lm) If w and r ± is fixed, optimize for δ: O(nLm)

35 Fast extreme Multi-label Learning (FastXML) Prabhu&Varma KDD 2014 min w,δ,r,r ) w ( + y C δ $ log(1 + e X _w x _ ) C $Y( ) $Y( ( \ 1 + δ $ L ).±² (r [, y $ ) C ) ( $Y( \ 1 δ $ L ).±² r X, y $ If r ± and δ is fixed, optimize for w: l ( regularized logistic regression The most costly step Fix w and alternately update δ and r ±, perform update on w only when updating δ and r ± does not lead to decrease in objective

36 Fast extreme Multi-label Learning (FastXML) Prabhu&Varma KDD 2014

37 Fast extreme Multi-label Learning (FastXML) Prabhu&Varma KDD 2014

38 Fast extreme Multi-label Learning (FastXML) Prabhu&Varma KDD 2014 Exploit label correlations by directly optimize rank-sensitive loss function (ndcg) Linear separator to partition each node in the tree Jointly learn the partition and label assignment Sublinear to L at prediction

39 Embedding-based vs Tree-based Embedding-based Simple Strong theoretical foundations Easy to handle label correlations Lose information during compression/decompression Slow at prediction Tree-based Efficient Large model size Not theoretically well supported

40 Overview Introduction Baseline Methods Key Challenges Embedding-based Methods Tree-based Methods LETOR Method Conclusion

41 Multilabel classification with meta-level features in a learning-to-rank framework Yang & Gopal Mach Learn 2012 The methods discussed above, have been focusing on low level features that do not characterize instance-label relationships Low level features may not be expressive enough for learning instance-label mapping Construct meta-level features based on both the original instance feature and labeled instances, and reformulate the problem as a standard learning-to-rank problem

42 Multilabel classification with meta-level features in a learning-to-rank framework Yang & Gopal Mach Learn 2012 Meta-level feature φ x, y, x R., y 1,, L φ x, y should carry information about the relation of x to y, with discriminative power Obtain φ x, y directly from a given instance and a set of labeled instances φ :µ x, y = (d :µ x, x (,, d :µ x, x P ) where x, (, x P are the k nearest neighbors to x in L \ distance Similarly, we can construct φ : x, y, φ ¹º» x, y

Multilabel classification with meta-level features in a learning-to-rank framework Yang & Gopal Mach Learn 2012 Also include distance to centroids φ ¹i) x, y = (d :µ x, x ¼, d ¹º» x, x ¼

43 Multilabel classification with meta-level features in a learning-to-rank framework Yang & Gopal Mach Learn 2012 Also include distance to centroids φ ¹i) x, y = (d :µ x, x ¼, d ¹º» x, x ¼ ) where x ¼ is the centroid of all instances with label y Define φ x, y = φ :µ x, y, φ : x, y, φ ¹º» x, y, φ ¹i) x, y Plug in any learning-to-rank algorithm SVM-Rank, SVM-MAP, ListNet, etc.

44 Multilabel classification with meta-level features in a learning-to-rank framework Yang & Gopal Mach Learn 2012

45 Multilabel classification with meta-level features in a learning-to-rank framework Yang & Gopal Mach Learn 2012 Training: nl pairs of meta-level features Testing: Need to construct features with all L labels Restrict the label space to Lm candidates generated from other systems And re-rank these candidates

46 Overview Introduction Baseline Methods Key Challenges Embedding-based Methods Tree-based Methods LETOR method Conclusion

47 Conclusion Multi-label classification is fundamentally different from multi-class classification It is crucial for the success of multi-label classification methods to exploit label correlations Most baseline methods does not scale to large problems in practice Embedding-based and tree-based methods try to tackle these challenges from different angles SLEEC and FastXML

48 Label Partitioning for Sublinear Ranking (LPSR) Weston et al. ICML 2013 max α y 1 y $ $ y α ¼ (1 Φ y ¼ y _ s. t. α 0,1 : À _,Á ÂÀ _,Ã Φ r = ( ([i Ä Å R $,@ is the rank of label j for instance i, according to base scorer ) back

Supervised vs unsupervised clustering

Classification Supervised vs unsupervised clustering Cluster analysis: Classes are not known a- priori. Classification: Classes are defined a-priori Sometimes called supervised clustering Extract useful