Semi-Supervised SVMs for Classification with Unknown Class Proportions and a Small Labeled Dataset

Size: px

Start display at page:

Download "Semi-Supervised SVMs for Classification with Unknown Class Proportions and a Small Labeled Dataset"

Lynn Blake
5 years ago
Views:

1 Semi-Supervised SVMs or Classiication with Unknown Class Proportions and a Small Labeled Dataset S Sathiya Keerthi Yahoo! Labs Santa Clara, Caliornia, U.S.A selvarak@yahoo-inc.com Bigyan Bhar Indian Institute o Science Bangalore, India bigyan@csa.iisc.ernet.in Shirish Shevade Indian Institute o Science Bangalore, India shirish@csa.iisc.ernet.in S Sundararajan Yahoo! Labs Bangalore, India ssrajan@yahoo-inc.com ABSTRACT In the design o practical web page classiication systems one oten encounters a situation in which the labeled training set is created by choosing some examples rom each class; but, the class proportions in this set are not the same as those in the test distribution to which the classiier will be actually applied. The problem is made worse when the amount o training data is also small. In this paper we explore and adapt binary SVM methods that make use o unlabeled data rom the test distribution, viz., Transductive SVMs (TSVMs) and expectation regularization/constraint (ER/EC) methods to deal with this situation. We empirically show that when the labeled training data is small, TSVM designed using the class ratio tuned by minimizing the loss on the labeled set yields the best perormance; its perormance is good even when the deviation between the class ratios o the labeled training set and the test set is quite large. When the labeled training data is suiciently large, an unsupervised Gaussian mixture model can be used to get a very good estimate o the class ratio in the test set; also, when this estimate is used, both TSVM and EC/ER give their best possible perormance, with TSVM coming out superior. The ideas in the paper can be easily extended to multi-class SVMs and MaxEnt models. Categories and Subject Descriptors: I.5.2 [Pattern Recognition] Design Methodology-Classiier design and evaluation General Terms: Algorithms, Experimentation Keywords: Transductive and Semi-supervised learning, Classiication, Support Vector Machines. INTRODUCTION The problem o classiying web pages into a given set o Permissiontomakedigitalorhardcopiesoallorpartothisworkor personal or classroom use is granted without ee provided that copies are not made or distributed or proit or commercial advantage and that copies bearthisnoticeandtheullcitationontheirstpage.tocopyotherwise,to republish, to post on servers or to redistribute to lists, requires prior speciic permission and/or a ee. CIKM, October 24 28, 2, Glasgow, Scotland, UK. Copyright 2 ACM //...$.. classes arises requently in web applications. Linear classiiers such as support vector machines (SVMs) and maximum entropy (MaxEnt) models employing a rich eature representation (e.g., bag-o-words) are used successully or this purpose. When labeling is expensive one is orced to work with a small set o labeled examples or training. In such cases a linear classiier trained using only labeled data does not give a good perormance. Perormance can be signiicantly boosted by employing semi-supervised methods that make eective use o unlabeled data, which is usually available in plenty. The transductive support vector machine (TSVM) [] and MaxEnt models trained with expectation regularization (ER) [2] are good examples o such methods. Such semi-supervised methods assume the knowledge o auxiliary inormation about the underlying data such as class proportions. The perormance o semi-supervised methods such as ER and TSVM are sensitive to the class proportion values used to ind their solution; see section 4.2 or a detailed empirical analysis. In many practical situations, class proportions are unknown. Oten it also happens that the class proportions in the labeled training set are dierent rom those in the actual distribution to which the classiier will be applied. A ew examples o such scenarios are as ollows. There are tens o thousands o city aggregator sites covering categories such as dining, attractions, night lie etc. The content and vocabulary across these sites are very similar, but in each site the proportion o examples belonging to various classes can be quite dierent. So, building a separate classiier or each site is appropriate. Let us say that we are given a ew global examples o dining and non-dining pages. The aim is to develop a classiier or each site to separate dining pages rom others making use o unlabeled web pages in that site. Frequently, in classiier design, one would like to reuse labeled data rom a previous run to develop a classiier or a new distribution (e.g., a new stream o news pages); the old labeled set is still good as the union o sets o samples rom the individual classes, but class proportions are dierent. A user shows a ew relevant and irrelevant example

2 web pages and asks or inding relevant pages rom a resh collection o web pages. The main aim o this paper is to address the problem o mismatched class proportions mentioned above. Let x denote a generic example web page and c denote a class. From a probabilistic viewpoint our problem consists o dealing with a situation in which: (i) p(x c) is same in the labeled data and the actual data o interest or all c; but, (ii) {p(c)} is very dierent in the two sets. Coupled with labeled data being small the problem becomes diicult. Note that, even i labeled data is not small, extension o methods such as TSVM and ER to deal with mismatched class proportions requires careul thought because labeled data does not have inormation on actual class proportions and unlabeled data has to be suitably brought in to estimate them. There have been some previous eorts to address this problem or TSVMs [7, 4]. But these methods have some basic issues: they make adhoc changes to the training process in an active sel-learning mode and are not cleanly tied to a well-deined problem ormulation; they signiicantly deviate rom the TSVM ormulation and hence perorm quite worse than TSVMs when labeled data and the actual distribution are indeed well matched in terms o class proportions; and, they have not been demonstrated in diicult situations involving large distortions in class proportions. In this paper we empirically analyze various issues related to the problem o mismatched class proportions, propose ways o estimating actual class proportions and demonstrate their useulness. We only take up binary classiication problems in this paper; but the ideas are quite general and they have the potential or extension to multi-class settings. Since binary classiication involves only two classes (positive and negative), class proportion can be represented using a single quantity, the raction o positive examples.. Contributions o the paper In a nutshell ollowing are our main contributions, listed by order o treatment in the paper.. The ER method is introduced in the context o SVMs. In addition we introduce two related methods based on expectation constraints (EC) and simple threshold adjustment (SVMTh). 2. We empirically analyze all the methods (TSVM, ER, EC and SVMTh) in terms o their learning curves and their sensitivity to incorrect speciication o. 3. I labeled data has distorted class proportion (even severe), but the actual happens to be known, we empirically show that the methods are pretty much unaected. 4. To handle the crucial case in which the actual is unknown we propose and empirically analyze two methods or estimating the actual. The irst estimation method is based on inding the having the least labeled loss along a trajectory o solutions given by a method as is varied. The second estimation method is based on itting a mixture o two Gaussians to the output o the SVM trained using only labeled data. The irst estimation method works well with TSVM and it is suited or the situation where labeled data is small while the second estimation method is powerul and works well with all semi-supervised methods when labeled data is large. Overall, TSVM combined with the two estimation methods (suitably switched depending on the amount o labeled data available) is the top perorming method. 2. DESCRIPTION OF METHODS In this section we describe all the methods that will be analyzed in this paper. These methods are meant or binary classiication. Labeled data consists o l examples {x i, y i} l i= where the input patterns x i R d are eature vectors representing web pages and the labels y i {+, }. Semisupervised/transductive methods make use o unlabeled examples in addition; unlabeled data consists o u examples {x i} l+u i=l+. All the methods develop a linear classiication unction w T x with {x : w T x > } denoting the positive class region. Web page classiication problems involve a large eature space (d is large); the input patterns are sparse, i.e., the raction o non-zero elements in one x i is small. Unlabeled data is always a random sample picked rom the actual distribution to which the classiier will be applied; labeled data, on the other hand, may have a class proportion which is dierent rom that in the actual distribution. 2. SupervisedSVMs Supervised SVMs make use o labeled data and optimize the regularized large margin loss unction: min w Tsup = λ 2 w 2 + l l L(y i, o i) () where o i = w T x i. An example o the large margin loss unction is the squared hinge loss L(y, o) = max(, yo) 2. Alternatively one can use the hinge loss: L(y, o) = max(, yo). All the experiments in this paper are done with the squared hinge loss. Irrespective o which loss is used there exist very eicient numerical methods [, 3] or solving (); the running time o these methods is linear in the number o examples. On web page and text classiication tasks the perormance o the SVM classiier is quite steady over a large range o values o the regularization coeicient, λ. In all the experiments o the paper we use λ =. 2.2 TransductiveSVMs Transductive/Semi-Supervised SVMs (TSVMs) make eective use o unlabeled data to enhance the perormance o SVMs. These methods perorm very well on web page and text classiication problems. Even with just a ew labeled examples they can combine this labeled data with a large unlabeled set to attain a perormance equal to that o a supervised SVM designed using a large labeled set. Through unlabeled examples many eatures which are even completely absent in the labeled set end up getting very good weights. TSVMs are based on the cluster (or, low density separation) assumption which states that the decision boundary should not cross high density regions, but instead lie in low density regions. Joachims [] gave the irst eective TSVM ormulation. A key ingredient o this ormulation is that it Good sotware or supervised and transductive SVMs can be ound in and vikas.sindhwani.org/svmlin.html. i=

3 assumes, the raction o positive examples in the actual distribution is known and also the corresponding constraint is enorced in the ormulation. 2 Joachims [] solved the ollowing problem in which, apart rom w, the set o labels o unlabeled examples, {y i} are also variables: min w,{y i } T sup + u L(y i, o i) s.t. u [y i == ] = (2) where [z] is i z is true and otherwise. Unlike (), (2) is a not a convex minimization problem. In general, it is hard to solve. It has been pointed out [5] that the solution o (2) can get stuck in poor local minima. Fortunately, this is not the case in linear classiication problems involving large eature space, such as web page and text classiication. TSVMs are routinely used to solve applied problems [4, 3]. Alternative optimization iterations (ix w and optimize {y i}, then ix {y i} and optimize w) are usually used to solve (2). There also exist variations o the TSVM algorithm. For example the discrete variables in {y i} can be eliminated rom (2) to get the ollowing equivalent problem: min Tsup+ w u min{l(, o i), L(, o i)} s.t. [o i > ] = u (3) Sigmoid smoothing or other methods can be applied to write [o i > ] in a dierentiable orm. Then the solution can be approached via gradient based optimization o w. See [6, 3, 5] or methods o this kind. Most oten this method yields a perormance that is slightly better than (2). 2.3 ExpectationMethods Mann and McCallum [2] proposed the Expectation Regularization (ER) method in the context o MaxEnt models. The idea is to use expectation terms related to some domain knowledge to inluence the training process. This can be easily done with SVM models too, like we do here. In our case the domain knowledge o interest is the act that the raction o positively classiied examples in unlabeled data equals. This raction constraint can be used to inluence the solution by including an additional regularization term in the objective unction: min w Tsup + ρ 2 ( u [o i > ] ) 2 (4) As mentioned earlier with respect to (3), sigmoid smoothing can be used to make the expectation regularization term to be dierentiable so that gradient based numerical optimization techniques can be employed. In our implementation o ER we use ρ = 5 as the deault value. Later we will also study the eect o varying ρ. Note that in (4) the expectation constraint on the raction o positive examples is only approximately enorced since it is only included as a regularizer term. I the domain knowledge says that the raction constraint holds certainly then it may be better to orce the constraint rather than adding a regularizer. This leads to a new method which we call as the Expectation Constraints (EC) method. The optimiza- 2 We will reer to the constraint as the raction constraint. Also, when applied in a discrete setting (e.g., the raction constraint in (2)) it is assumed that appropriate rounding o is allowed. tion problem to be solved is: min w Tsup s.t. u [o i > ] = (5) Again, sigmoid smoothing can be used to write the raction constraint unction in a dierentiable orm. A suitable numerical method that can deal with equality constraints can be used to solve or w. In our implementation we use the Augmented Lagrangian method []. Let us introduce another expectation based baseline method that has been ignored in the literature. The method consists o taking the supervised SVM solution w (i.e., solution o ()) and adding a threshold θ to the scoring unction so that the raction o unlabeled examples that satisy w T x+θ > equals. The new classiier boundary is deined by w T x + θ =. We will reer to this method as SVMTh. 2.4 Relations between the methods It is easy to see that ER and EC methods are closely related. In particular, when the parameter ρ in (4) is made very large then ER and EC are expected to show very similar behavior. SVMTh is quite dierent rom EC although both methods enorce the raction constraint; while EC tries to balance the minimization o T sup with the raction constraint, SVMTh simply adjusts the threshold to satisy the raction constraint without worrying about its eect on T sup. I we compare (3) and (5) we see that TSVM has an extra term (loss on unlabeled data). This term turns out to be important as it helps TSVM to place the classiier boundary more precisely by making it pass through low density regions o unlabeled data; as we will see later the perormance improvement due to it is large especially when the size o labeled data is small. For MaxEnt models, Grandvalet and Bengio [9] suggested including an unlabeled loss term called entropy regularization that is similar in spirit to the unlabeled loss term in (3). However, their ormulation does not include the raction constraint. Mann and McCallum [2] conduct experiments to clearly point out the act that without this constraint this method does not do well. They also show that, i the constraint is included (even as a regularizer term) then the perormance is greatly enhanced. 3. DATASETS AND NOTATIONS All the empirical analyses in this paper are done using the ollowing ive text classiication datasets: gcat, ccat, autavn, real-sim and earn. The irst our datasets are same as the ones used in [3]; earn is a binary dataset created rom the Reuters-2578 dataset 3 by taking examples in the earn class as positive and the remaining examples as negative. Key properties o the datasets are given in Table. Each dataset is divided into three parts: labeled data, unlabeled data and test data. Unlabeled data and test data have the same class proportion; let actual denote the raction o positive examples in test/unlabeled data. Labeled data can have a dierent class proportion depending on the experiment; note that the main aim o the paper is to ind ways o dealing with a mismatch in class proportion between labeled data and the actual distribution. Let lab denote the raction o positive examples in labeled data. 3 testcollections/reuters2578/

4 Table : Properties o datasets. n : number o examples, d : data dimensionality, actual : positive class ratio. Dataset n d actual gcat ccat aut-avn real-sim earn gcat ccat Always actual is set to the raction o positive examples in the original, ull dataset. Given lab and the number o labeled examples, n lab, we create a random division o the ull dataset into test, labeled and unlabeled data as ollows. First, 5% o the examples are randomly chosen in a classstratiied ashion and kept as testing data. Let R denote the set o remaining examples. We orm labeled data by choosing n lab labˇ positive examples rom R and n lab n lab labˇ negative examples rom R. O the remaining examples, unlabeled data is randomly chosen to be the largest set obeying actual. Since the dataset sizes are large and n lab is small, the size o unlabeled data is large and it is several times larger than n lab. Perormance o the methods will be measured in terms o F score, which is the harmonic mean o precision and recall o the positive class. To ease the understanding we use consistent notations in all the igures. The ollowing colors will be used to denote the various methods: Black dotted: LSVM - SVM trained with labeled examples only; Red dotted: FullSVM - SVM trained using all unlabeled examples taking their labels as known; Blue: TSVM - Transductive SVM; Black: SVMTh - LSVM with threshold adjustment based on ; Green: ER - Expectation Regularization; and, Red: EC - Expectation Constraints. In most igures or lab is the horizontal axis. In such igures actual is shown as a vertical dotted line. In most igures the plots are averages over our random splits o a dataset into labeled, unlabeled and test data. Exceptions are Figures 3, 8, 9, and where just one random split is used. In the igures LabSize and denote, respectively, the size o labeled data, n lab and the raction o positive examples in labeled data, lab. 4. BASICEPERIMENTS Beore we go on to deal with dierences in class proportions in labeled data and unlabeled/test data it is useul to give some basic experimental results that help understand the methods. 4. LearningCurves Figure shows the learning curves (variations o perormance with n lab ) or various methods on our datasets. All methods use the value = actual. TSVM gives a much LabSize aut avn LabSize LabSize real sim 5 5 LabSize Figure : Learning curves or various datasets. All methods use = actual. stronger perormance than other methods at small values o n lab. Even at higher values o n lab it is better than or competitive with others. Whenever TSVM does signiicantly better than SVMTh it is a clear indication o the important role played by the unlabeled loss term in (2). The magnitude o perormance improvement given by TSVM over others depends on the dataset. On gcat and aut-avn TSVM is very much superior; on ccat it is quite better than others; and, on real-sim its perormance matches that o SVMTh. Ater TSVM, SVMTh is the next best perormer, ollowed by EC, ER and LSVM, in that order. 4.2 SensitivityAnalysis Since our aim is to deal with situations in which actual is unknown, it is useul to understand how the perormance o each method is aected when (the value o positive class raction used in ()-(5)) is changed away rom actual. Figure 2 describes this sensitivity or n lab = 9 and n lab = 44. We chose these n lab values as they represent the rising and stable parts o the learning curves or most (method, dataset) combinations. Let us irst make some observations or n lab = 9. All methods suer when is too much lower/higher than actual. I is suiciently ar rom actual then the perormance can degrade to be even worse than that o LSVM. The all in perormance on the lower side is sharper since recall is badly aected when is decreased rom actual and this has an adverse eect on the F score. TSVM and SVMTh attain their peak perormance at values that are close to actual. On the other hand, ER and EC attain their best perormance at values that are quite shited away rom actual. Consistently, this shiting away rom actual is such that the raction o the minority class is larger than its actual value in the unlabeled/test distribu-

5 gcat LabSize: 9 ccat LabSize: 9 aut avn LabSize: 9 real sim LabSize: gcat LabSize: ccat LabSize: aut avn LabSize: 44.5 real sim LabSize: Figure 2: Sensitivity o methods with respect to. The top row is or n lab = 9 and the bottom row is or n lab = 44. tion; in the top row o Figure 2 note the let shit o the peak or aut-avn and right shits or all other datasets. This is due to two properties associated with LSVM, o which EC and ER are modiications. First, when n lab is small the direction o the LSVM classiier makes a large angle with that o the ideal SVM classiier built using a large labeled training set. So the LSVM classiier boundary cuts through dense regions o the test set. Second, this cutting is more severe on the minority class examples due to lesser representation. 4 Thus, LSVM (which optimizes T sup) ends up choosing a classiier boundary placement in which many test examples o the minority class are pushed to the wrong side. See the let bottom plot o Figure 8 (which comes later in the paper) to conirm this or the gcat dataset. Thereore, to pull the classiier towards satisying the actual raction and hence do well on F score, EC and ER methods need to use a value (see (4) and (5)) that is higher (lower) than actual when the minority class is the positive (negative) class. The value that is needed by ER to get the best F score on the test set is arther rom actual than the value needed or EC because ER only loosely enorces the constraint. SVMTh behaves dierently because it doesn t try to balance T sup optimization and the raction constraint; it simply enorces the raction constraint in post-processing without worrying about what happens to T sup in that step. When n lab is large the behavior is dierent. The bottom row o Figure 2 gives sensitivity or LabSize=44. LSVM gives a very good perormance since labeled data set is suiciently large. The perormance o ER is close to that o 4 This can be explained as ollows. Take the case where the positive class is the minority class. Since the positive class has a smaller number o loss terms (see ()) the classiier boundary o LSVM is placed closer to the labeled examples o the positive class in order to reduce the total loss on the labeled examples o the negative class. Thereore the raction o examples classiied by LSVM as positive is less than actual. LSVM (see the green continuous and black dotted lines in the bottom row o Figure 2) because the ρ value (5) used in (4) is not strong enough to exert the raction constraint. TSVM, SVMTh and EC suer when is moved well away rom actual because they enorce the raction constraint. The observations made above imply that, or EC and ER it is a good idea to employ an value that is dierent rom actual, when n lab is small. This observation is missing in previous works on ER [2]. I the class proportions in labeled, unlabeled, and test data are all matching and n lab is not too small, it is then a good idea to use cross validation to choose to optimize perormance. I n lab is not large the Leave-One-Out method can be used or cross validation. However we do not go ahead and demonstrate this in this paper since the main ocus o the paper is to suggest ways o handling a mismatch between class proportions in labeled data and the actual distribution o interest. Figure 3 gives the sensitivity plots or the earn dataset. The experimental setup is close to that used by Zhang and Oles [5] who used the setup to point out (wrongly) that TSVM does not work well. Zhang and Oles [5] possibly optimized the TSVM objective unction in (2) without the raction constraint. Figure 3 explains what can go wrong i that is done. From the right side plot o Figure 3 it is clear that the least value o the objective unction occurs at = where the perormance is very poor. This clearly shows the important role played by the raction constraint in TSVM. Recall a similar comment that we made earlier at the end o subsection 2.4, with respect to MaxEnt models and entropy regularization. 5. DISTORTION IN CLASS PROPORTION The results o section 4 concern the case in which labeled data and test/unlabeled data have the same class proportion. We now consider situations in which the class proportions in the two sets are dierent. Sometimes it happens that even though there is a dierence in those class proportions,

6 .8.6 earn LabSize: earn :.3.5 Figure 3: Earn dataset: sensitivities o the methods with respect to with just 2 labeled examples. The let plot shows perormance o various methods. The right plot shows TSVM perormance (blue) and TSVM objective unction in equation (2) (magenta) (both are normalized to max value o to show them on the same plot). When actual is known TSVM gives an F score o.93. gcat LabSize: 9 C Adjusted gcat LabSize: Figure 4: Demonstration that cost adjusted labeled loss does not change perormance signiicantly. actual may actually be known. Such knowledge helps the methods to achieve a good perormance. In subsection 5. we take up this case. In subsection 5.2 we deal with the more diicult case in which actual is unknown. 5. Case: actual isknown Since we know actual as well as the number o positive and negative labeled examples in the labeled set, it is appropriate to modiy the SVM objective unction T sup so that the labeled loss term can be viewed as i it is the mean loss o labeled data picked (with replacement) rom the actual distribution o interest. To do this we change T sup to T Adj sup by introducing γ, a relative weighting parameter or positive labeled examples. T Adj sup is given by T Adj sup = λ 2 w 2 + N (γ i l,y i = L(y i, o i)+ i l,y i = L(y i, o i)) where γ is chosen so that γ lab = actual, lab = ( lab ) actual n p n p+n n, n p and n n are the number o positive and negative examples in labeled data, and N = n pγ + n n is the normalizer chosen to make sure that the labeled loss term is the mean loss. Note that when lab < actual, γ is bigger than, and so the weight on the positive labeled examples is increased. When lab > actual, γ is smaller than. All (6) gcat LabSize: aut avn LabSize: ccat LabSize: real sim LabSize: Figure 5: Variation o the perormance o methods with lab (LabelFrac) when actual is known and the methods use = actual. the methods can be appropriately modiied to use Tsup Adj instead o T sup. However, we ound that the perormance o the methods are little aected when Tsup Adj is used instead o T sup. Figure 4 compares the perormance or one dataset, gcat. Similar behavior is seen on other datasets. Figure 5 compares all methods on our datasets when = actual is used by the methods and lab is varied over a range o values rom to. Interestingly, TSVM, SVMTh and EC are only mildly aected by changes in lab. This is very much due to actual being known and used well by the methods via the raction constraint with = actual. On the other hand LSVM is badly aected when lab is moved away rom actual. ER suers when lab < actual. The penalty weight, ρ used in (4) plays a key role in ER. Figure 6 shows the perormance variation or a range o ρ values. When ρ is small the perormance o ER is close to that o LSVM. For large ρ its perormance is close to that o EC. Both these observations are along expected lines. For dierent values o lab the optimal ρ is dierent. 5.2 Case2: actual isunknown We propose and explore some methods or estimating actual in the presence o large variations in lab. Note that, even when labeled data is large, it cannot be used alone to get an estimate o actual since the class proportion in that set are distorted. We explore three methods or estimating actual. Let us reer to an estimate o actual as est actual. Estimation Method. Simply take est actual to be the raction o positive examples in labeled data. This is just a baseline method. The next two methods make use o unlabeled data together with labeled data. Estimation Method 2. Given labeled data we can solve () and obtain the LSVM solution w. From the good perormance o SVMTh over a large range o lab values in Figure

7 gcat LabSize: 9 ccat LabSize: 9 3 gcat LabSize: 9 3 gcat LabSize: aut avn LabSize: real sim LabSize: 9 2 SVM SVM Full SVM Full SVM SVM SVM 44 Figure 6: Variation o ER perormance with lab (Label- Frac) or various ρ values. Dashdot: ρ = 5 (gives perormance close to LSVM); Continuous: ρ = 5 (same as in Figure 5); Dotted: ρ = 5; Dashed: ρ = 5. ER with ρ 5 yields a perormance close to that o EC. EstFrac gcat LabSize Figure 7: Estimation Method 2. Mean (with error bars) o versus n lab or gcat. actual =.3 or gcat. actual est 5 we know that the classiier direction corresponding to w is good. It is appropriate to model the classiier scoring unction, o = w T x applied on unlabeled data, {o i = w T x i} as a mixture o two Gaussians: p(o i) = β g(o i; µ, σ ) + β 2g(o i; µ 2, σ 2) (7) where g( ; µ k, σ k ) is the univariate Gaussian density unction with mean µ k and standard deviation σ k, and β, β 2 are (non-negative) mixing proportion values satisying β +β 2 =. By itting this model to the data {o i} to maximize likelihood we can get the parameter values β, β 2, µ, µ 2, σ and σ 2. The Gaussian with the larger µ k represents the positive class; so, the corresponding β k can be taken as est actual, an estimate o the positive class raction. Figure 8 helps us understand the useulness o this method. When the number o labeled examples is small, w, the solution o (), makes a large angle with w, the solution associated with the ideal SVM classiier built using a large labeled training set having the actual class proportion; so the {o i = w T x i} distribution is not a clean mixture o two Gaussians, and the estimate est actual is also poor. On Figure 8: Understanding Estimation Method 2. SVM-N denotes LSVM output or N labeled examples and FullSVM is output o SVM corresponding to a very large labeled set. the other hand, when the number o labeled examples is large (closer to the higher side o the learning curve) this estimation method perorms much better. Figure 7 plots the mean (with error bars) o est actual (computed using random splits o gcat) as n lab is varied. Such a plot can help one decide when this estimation method is useul. Estimation Method 3. In this method we explore to see i some property associated with the solution o a method can be used to ind est actual. Let us take TSVM. Figure 9 gives, or gcat and ccat, the variation o the TSVM objective unction (see (2)), the Labeled Loss (the second term o T sup in ()), as well as the F score with respect to used in (2). The behavior in other datasets is similar. Clearly the TSVM objective unction is not useul or tuning. Gärtner et al [8] suggest to minimize the TSVM objective unction to tune class proportions; but Figure 9 clearly points out that it is not the right thing to do. The plots also show the goodness o minimizing the Labeled Loss as a way o obtaining est actual. 5 A rough explanation or this goodness is as ollows. Consider (2), the optimization problem associated with TSVM. End values (near and ) put heavy pressure on containing all examples on one side o the classiier boundary. With such pressure Labeled Loss becomes large or such values. Near = actual minimizing Labeled Loss, Unlabeled Loss and satisying the raction constraint are all consistent, so Labeled Loss achieves small values there. We also tried the same estimation method (i.e., minimize labeled loss along the trajectory o solutions deined by varying ) on the expectation methods. But it did not work well. Figure gives, or EC and gcat and ccat, the variation o labeled loss as well as the F score as a unction o. The minimizer o the labeled loss does not coincide quite well with the maximizer o F score. Similar behavior is observed 5 To avoid conusion we note that (2) is always the optimization problem that is solved to get the TSVM solution; minimizing the Labeled Loss that we are doing here is or tuning at a higher level treating as a hyperparameter.

8 gcat :.59 ccat :.68 gcat :.59 ccat : gcat :.45.5 ccat :.58.5 gcat :.45.5 ccat : gcat :.3.5 ccat :.47.5 gcat :.3.5 ccat : gcat :.25.5 ccat :.38.5 gcat :.25.5 ccat : gcat :.9.5 ccat :.28.5 gcat :.9.5 ccat : Figure 9: Variation o perormance (blue), TSVM objective unction in (2) (magenta) and Labeled Loss (the second term o T sup in ()) (cyan) with respect to, or gcat and ccat. The minimizer o Labeled Loss is marked by a black circle. The rows correspond to dierent raction distortions. The middle row corresponds to lab = actual. The bottom two rows correspond to lab =.8 actual and lab =.6 actual. The top two rows correspond to ( lab ) =.8( actual ) and ( lab ) =.6( actual ). Figure : EC: Variation o perormance (red) and Labeled Loss (the second term o T sup in ()) (cyan) with respect to, or gcat and ccat. The minimizer o Labeled Loss is marked by a black circle. The rows correspond to dierent raction distortions. The middle row corresponds to lab = actual. The bottom two rows correspond to lab =.8 actual and lab =.6 actual. The top two rows correspond to ( lab ) =.8( actual ) and ( lab ) =.6( actual ).

9 or SVMTh and ER. For n lab = 9 Figure shows the perormance o Estimation Method 3 with EC and SVMTh. Figure 2 compares the three estimation methods as applied to TSVM. For n lab = 9 Estimation Method 2 doesn t do even as well as Estimation Method. Estimation Method 3 is very good in a large range o lab values containing actual. It suers only at extreme lab values. When n lab is large, e.g., 44, Estimation Method 2 perorms very well. It is appropriate to switch between the two estimation methods depending on how large n lab is. Going by what we saw in Figure 7, we can use the steadiness o est actual given by Estimation Method 2 over a range o n lab values to decide when to switch to it. O course, n lab has to be reasonably big to notice a clear steady pattern. Until that size is reached it is best to use Estimation Method 3. When n lab is large LSVM itsel does quite well; since it is not dependent on it becomes the preerred method or that case. Thereore, the value o a raction estimation method should be determined by how well it perorms when labeled data is small. From that viewpoint Estimation Method 3 (as applied to TSVM) is most valuable. O all (Method, Estimation Method) combinations (TSVM, Estimation Method 3) is the one that takes the most computing time; even or that combination, the computation time or one run on any o our ive datasets is less than two minutes on a 3 GHz machine. 6. RELATEDWORK Since the irst crucial paper o Joachims [] several works (or a sample see [5, 7, 4, 6, 3, 5]) have been published extending and exploring various aspects o the irst TSVM model. O these, the works o Chen, Wang and Dong [7] and Wang and Huang [4] are the most relevant to our work since they address the problem o mismatch in class proportions. The main drawback o these methods is that they make adhoc labeling o the unlabeled examples during the training process in an active sel-learning mode and are not cleanly tied to a well-deined problem ormulation. Due to this they signiicantly deviate rom the TSVM ormulation and hence perorm quite worse than TSVMs or the normal situation in which labeled data and the actual distribution are indeed well matched in terms o class proportions. Also, the methods have been demonstrated only in situations where LSVM itsel gives decent perormance, which is not true when there are large distortions in class proportions. Mann and McCallum [2] empirically analyze ER in binary, multi-class and structured output settings in the context o MaxEnt models. ER s working and perormance in SVMs and MaxEnt models are similar. Mann and McCallum [2] do a brie study o the sensitivity o ER s perormance with respect to class proportions. But they only ocus on the case o distortion towards uniorm class proportions; in the binary case this corresponds to changing rom actual to.5. Unortunately their study is done only on one multi-class dataset. As we saw in section 4.2, ER shows interesting behavior in the binary case; in act ER improves in perormance when is moved towards.5. SVMTh is an important baseline method that is completely missed in [2]. There is a building literature on dealing with dierences in train and test distributions reerred to as covariate shit (see [2] or instance), but such papers introduce and analyze new ormulations and do not help to modiy TSVM or the expectation methods while keeping their spirit in tact. 7. CONCLUSION The main contributions o this paper are: (i) empirical analysis o TSVM and expectation methods and their sensitivity with respect to label proportions; and, (ii) proposal and evaluation o new methods or dealing with mismatches in label proportions between labeled and test sets. We have also done preliminary experiments to veriy the ideas on Least squares and MaxEnt models o binary classiication and observed behavior similar to SVM loss. With some care all the ideas can also be extended to the multi-class setting. The ideas and results o this paper are mainly or web page and text classiication. More work is needed to see i they hold on other types o problems. 8. REFERENCES [] D. P. Bertsekas and J. N. Tsitsiklis. Parallel and Distributed Computation: Numerical Methods. Athena Scientiic, 997. [2] S. Bickel, M. Brückner, and T. Scheer. Discriminative learning or diering training and test distributions. In ICML, pages 8 88, 27. [3] L. Bruzzone, M. Chi, and M. Marconcini. A novel transductive SVM or semisupervised classiication o remote-sensing images. IEEE-GRSS, , 26. [4] O. Chapelle, B. Schölkop, and A. Zien. Semi-Supervised Learning. MIT Press, 26. [5] O. Chapelle, V. Sindhwani, and S. S. Keerthi. Optimization techniques or semi-supervised support vector machines. JMLR, 9:23 233, 28. [6] O. Chapelle and A. Zien. Semi-supervised classiication by low density separation. In AISTATS, pages 57 64, 25. [7] Y. Chen, G. Wang, and S. Dong. Learning with progressive transductive support vector machine. Pattern Recognition Letters, 24: , 23. [8] T. Gärtner, Q. V. Le, S. Burton, A. J. Smola, and S. V. N. Vishwanathan. Large-scale multiclass transduction. In NIPS, 25. [9] Y. Grandvalet and Y. Bengio. Semi-supervised learning by entropy minimization. In NIPS, 24. [] T. Joachims. Transductive inerence or text classiication using support vector machines. In ICML, 999. [] T. Joachims. Training linear SVMs in linear time. In KDD, 26. [2] G. S. Mann and A. McCallum. Generalized expectation criteria or semi-supervised learning with weakly labeled data. JMLR, : , 2. [3] V. Sindhwani and S. S. Keerthi. Large scale semi-supervised linear SVMs. In SIGIR, pages , 26. [4] Y. Wang and S.-T. Huang. Training TSVM with the proper number o positive samples. Pattern Recognition Letters, 26: , 25. [5] T. Zhang and F. J. Oles. A probability analysis on the value o unlabeled data or classiication problems. In ICML, pages 9 98, 2.

10 gcat LabSize: 9 ccat LabSize: 9 aut avn LabSize: 9 real sim LabSize: gcat LabSize: 9.5 ccat LabSize: 9.5 aut avn LabSize: 9.5 real sim LabSize: Figure : Perormance o Estimation method 3 as applied to EC (top row) and SVMTh (bottom row). Estimation Method 3: Continuous. Dashed: Use = actual (Upper baseline) gcat LabSize: 9 ccat LabSize: 9 aut avn LabSize: 9 real sim LabSize: gcat LabSize: 44.5 ccat LabSize: 44.5 aut avn LabSize: 44.5 real sim LabSize: Figure 2: Comparison o actual estimation methods as applied to TSVM. Estimation Method : Dashdot. Estimation Method 2: Dotted with x. Estimation Method 3: Continuous. Dashed: Use = actual (Upper baseline)

Extension of TSVM to Multi-Class and Hierarchical Text Classification Problems With General Losses

Extension of TSVM to Multi-Class and Hierarchical Text Classification Problems With General Losses S.Sathi ya Keer thi (1) S.Sundarara jan (2) Shirish Shevade (3) (1) Cloud and Information Services Lab,