BAYESIAN MULTI-SOURCE DOMAIN ADAPTATION

BAYESIAN MULTI-SOURCE DOMAIN ADAPTATION SHI-LIANG SUN, HONG-LEI SHI Department of Computer Scence and Technology, East Chna Normal Unversty 500 Dongchuan Road, Shangha 200241, P. R. Chna E-MAIL: slsun@cs.ecnu.edu.cn, lhsh12@gmal.com Abstract: Ths paper presents a new mult-source doman adaptaton framework based on the Bayesan learnng prncple (BayesMSDA), n whch one target doman and more than one source domans are used. In ths framework, the label of a target data pont s determned accordng to ts posteror, whch s calculated usng the Bayesan formula. To fulfll ths framework, a novel pror of the target doman based on Laplacan matrx and a new lkelhood dynamcally obtaned usng the k- nearest neghbors of a data pont are defned. We focus on the stuaton that there are no labeled data obtaned from the target doman whle there are large numbers of labeled data from source domans. Experments on synthetc data and real-world data llustrate that our framework has a good performance. Keywords: Bayesan framework; mult-source doman adaptaton; Laplacan matrx 1. Introducton Most theoretcal models n machne learnng, such as probably approxmately correct models (PAC models), assume that models are traned and tested usng data drawn from certan fxed dstrbutons. Unform convergence theory guarantees that a model s emprcal tranng error s close to ts true error under such assumptons. However, n practce assumptons that data for tranng and testng come from the same dstrbuton do not hold, because these two types of data usually come from dfferent dstrbutons, or domans. In these cases there s no hope for good generalzaton. We wsh to learn a model n one or more source domans (.e. domans from whch the tranng data come), and then apply t to a dfferent target doman (.e. doman from whch the test data come). Ths knd of learnng model s called doman adaptaton models [1], [2]. We confront ths problem n many felds, such as sentmental analyss [3], [4], [5], natural language processng [6], computer vson, etc. Often n these cases, source domans offer large numbers of labeled data for learnng, whle target domans may have no labeled data avalable. In ths paper, we concentrate on the stuaton that there are no labeled data n the target doman, and there s only one target doman. The task s to combne the labeled source data and unlabeled target data to classfy the target data as correctly as possble. The problem of mult-source doman adaptaton consdered n ths paper has been researched by many researchers [7], [8], [9], [10]. Crammer et al. [7] consdered a problem about learnng an accuracy model va the nearby data ponts from more than one source domans. They gave a general algorthm as follows: by usng samples from dfferent source domans to estmate the dvergence among these sources, the algorthm determnes whch samples from each source should be selected to tran the model. Thus a correspondng subset that best suts the target task was chosen from each source. On a bnary classfcaton task the algorthm was demonstrated to be effectve. Tu and Sun [8] gave an emsemble learnng framework for doman adaptaton. They presented a novel ensemble-based method whch dynamcally assgns weghts to dfferent test examples by usng the so-called frendly classfers. The model gave the most favorable weghts to dfferent examples. Mansour et al. [9] presented a theoretcal analyss of doman adaptaton learnng wth multple sources. They gave a combnaton of the source hypotheses weghted accordng to the source dstrbutons. In practce they showed that for any fxed target functon, there exsted a dstrbuton weghted combnng rule that has a loss at most ɛ. An nterestng ssue to consder n mult-source doman adaptaton s that what we should do f we do not know n advance whch doman performs best. On one hand, we want to use the most sutable source to solve the target task. On the other hand, we do not know whch one to choose. Ths paper gves a tradeoff for ths problem. In ths pa-

per, we present a mult-source doman adaptaton framework based on the Bayesan learnng prncple (BayesMSDA). Under the Bayesan framework, the determnaton of classfcaton s based on the posteror probabltes. These posterors are proportonal to the product of the prors and the lkelhoods. We defne n BayesMSDA framework a novel pror usng the Laplacan matrx [11], [12], and a novel lkelhood based on the mean Eucldean dstance of k-nearset ponts. The remander of ths paper s organzed as follows. The new framework and ts mplementaton for mult-source doman adaptaton are ntroduced n detal n Secton 2. In Secton 3, two experments on synthetc and real-world data are accomplshed to llustrate the effectveness of our framework. In Secton 4, conclusons and the future work are gven. 2. The proposed framework The proposed framework for mult-source doman adaptaton (BayesMSDA) s based on the Bayesan learnng prncple: the probablty of whch class a target example belongs to s proportonal to the product of pror and lkelhood assgned to ths example. For mult-source ssues, the core problem s how to combne these sources effectvely to solve the target task. The novel framework s descrbed as follows. Gven M source domans S, =1, 2,...M, and one target doman T. The task s to label the data n T, usng the unlabeled data n T and the large numbers of labeled data n S. Assume that we can get M classfers c, based on the M source domans S. Then we defne the pror, whch measures the ftness between the source doman and the target doman, and the lkelhood of each target data, whch represents the probablty of the target data occurng n the source. Applyng the Bayesan learnng prncple to get a posteror for classfcaton, we can use these posterors to weght the M classfers c to get a fnal label for the target data. The framework we present wth self-defned pror and lkehood s applcable to the stuaton that the data are unlabeled n the target doman, whch s compared wth the majorty votng algorthm. 2.1. Pror Consder a weghted undrected graph G =(V,E), wth the data set V = (x 1,x 2,...x n ),E = (e 1,e 2,...e l ). Assume that G s a connected graph (f not, the process followed can be used on each connected component). Let Y =(y 1,y 2...y n ) be the mage of the data set under certan mappng rules. The problem now s how to make the mages y and y j as close as possble when the data ponts x and x j are close. A reasonable crteron to guarantee ths s to mnmze the objectve functon G = (y y j ) 2 W j (1) under approprate constrans, where W j = e x x j T when x and x j are neghbors, and zero otherwse. Equaton (1) means that there s a heavy penalty when the mages of neghborhood ponts x and x j are far away from each other. Mnmzng t attempts to make sure that y and y j are close f x and x j are close. Ths property can be used n bnary classfcaton problems effectvely. The pror gves a measurement of ftness when a source classfer s appled on the target task: for a sample x n the target doman, t should be ndependent wth x. In ths paper, we construct the pror wth the Laplacan matrx [11], [12] of the target doman. We consder the ssue that data n the target doman whch are all unlabeled are used to quantfy the pror. For any Y, the objectve functon becomes G = = (y y j ) 2 W j ( 2 y + y 2 ) j 2y y j Wj = =2Y T LY y 2 D + j y j 2 D jj 2 y y j W j where L = D W s the Laplacan matrx. Notce that W j s symmetrc and D = W j s a dagonal matrx. D provdes a natural measure on the vertces of the graph G. The j bgger the value D (correspondng to the th vertex) s, the more mportant the vertex s. Gven n ponts x R d,=1, 2, 3,...n. We construct such an undrected graph G =(V,E), wth the neghbors of x are ts k-nearest neghbors. The steps of generatng a Laplacan matrx of the target data set are as follows. Step 1: (calculatng the adjacency matrx A) fx and x j are k-nearest neghbors, let A j =1 as well as A j =1, otherwse A j =0 and A j =0. Step 2: (calculatng the weght matrx W ) one of the varatons s the heat kernel: W j = e x x j T (2)

where T R. Step 3: (calculatng the Laplacan matrx L) let D = W j, then L = D W s the Laplacan matrx, whch s j a symmetrc, postve semdefnte matrx. Once we have a Laplacan matrx, the pror s defned as pror m = 1 = ( ) y m yj m 2Wj 1 2(Y m ) T LY m (3) where Y m s the output of the mth source classfer. In multsource doman cases, the dfferent prors of each source show the ftness between each source doman and the target doman. The bgger the pror s, the better the correspondng source classfer s. 2.2. Lkelhood The lkelhood we defne here represents the probablty of the nstance from the target doman occurng n the source doman, whch can also refer to the smlarty between the target doman and the source doman. The hgher the probablty descrbed above s, the better the source classfer s. In ths paper we use the mean Eucldean dstance of the K-nearest neghbors of nstance x (whch s from the target doman, and these K-nearest neghbors are from the source doman) to measure ths lkelhood. For each nstance x n the target doman, the lkelhoods n the dfferent source domans are dfferent, whch gve a dynamc classfyng rule. The lkelhood that the target data pont x occurs n the source doman S m s defned as Lke m K = x x m (4) j where x m j, whch comes form the mth source, s among the K-nearest neghbors of x. Accordng to the Bayesan learnng prncple, the posteror s proportonal to the product of the pror and the lkelhood: post m pror m Lke m (5) where post m s the posteror of x, based on the mth source doman, pror m s the pror of x based on the mth source doman, Lke m s the lkelhood of x based on the mth source doman. The posterors obtaned here are used to weght the source classfers. 3. Experments In ths secton, we evaluate the proposed framework by experments on both synthetc and real-world data sets. Each dataset has four domans. In the experments, every doman s treated as the target doman n turn whle the other three as source domans. We use support vector machnes (SVMs) [13], [14] for tranng and testng. For textual classfcaton SVMs have been found to perform better than other classfcaton methods [13], especally for the sentment analyss [3]. The kernel we use s RBF kernel. The parameters of SVMs are selected usng cross valdaton for each doman. 3.1. Synthetc data 18 16 14 12 10 8 6 4 2 0 2 0 2 4 6 8 10 12 14 16 Fgure 1. Examples of synthetc data:,, +, stand for a, b, c, d, respectvely. The smaller ones (.e. the bottom-left porton of each doman n the fgure) are postve, and the larger ones (.e. the top-rght porton of each doman n the fgure) are negatve. Each doman s treated as the target n turn, whle the other three as the sources The synthetc dataset conssts of four dfferent domans, each of whch s sampled from Gaussan dstrbutons wth dfferent covarances and means. Fgure 1 shows 30 randomly selected data ponts from each doman, and the dfferent symbols stand for dfferent domans, says, for a, for b, + for c, and for d. The smaller ones (.e. the bottom-left porton of each doman n Fgure 1) are labeled as postve, whle bgger

TABLE 1. Accuraces of One-to-One Classfers. (T for Target Doman and S for Source Doman) S B(%) D(%) E(%) K(%) T B - 79.8 68.7 65.5 D 78.8-69.6 69.3 E 65 69.2-78.9 K 63.6 70.6 81.4 - Fgure 2. Classfcaton accuraces (%) on synthetc data ones (.e. the top-rght porton of each doman n the Fgure 1) negatve. The base classfers are traned usng SVMs wth RBF kernel. Fgure 2 llustrates the classfcaton accuraces on these domans usng BayesMSDA and the majorty votng algorthm. Each doman s treated as the target n turn, whle the other three as the sources. It s shown n Fgure 2 that on three domans (a, b, d), BayesMSDA method outperforms the majorty votng method, whle on the thrd one (c) both are equal. Sgnfcantly, accuracy of BayesMSDA s far hgher than that of the votng method on the fourth dataset (98.77% VS 69.69%). As four datasets are randomly obtaned, the results gve us a confdence on the effectveness of the proposed framework. In the next subsecton, we apply BayesMSDA on real-world data. But on the other hand, as we can see, on the thrd dataset, two results are equal. Ths phenomenon s acceptable snce no such an algorthm can ft the whole stuatons. 3.2. Real data Gven a pece of text, sentment classfcaton s a task to determne whether the sentment expressed by the text s postve or negatve. Ths problem has extended to many new domans, such as stock message boards, congressonal floor debates, and blog revews. Research results have been used to gauge market reacton and summarze opnon from web pages, dscusson boards, and blogs. We use the publcly avalable data sets 1 from Amazon web- 1 http://www.cs.jhu.edu/ mdredze/ Fgure 3. Classfcaton accuraces (%) on realworld data ste n our experments [1], where there are many revews for several dfferent types of products. We select four domans: books, DVDs, ktchen & housewares, and electroncs (B, D, K, E for short, respectvelly). Each revew conssts of a ratng (1-5 stars), a ttle, revew text, and some other nformaton whch are gnored. We make t a bnary classfcaton task by bnnng revews wth 4-5 stars as postve and 1-2 stars as negatve, whle revews wth 3 stars are dscarded. As vocabulares of revews for dfferent products vary vastly, classfers traned on one doman may not ft a dfferent doman because some mportant lexcal nformaton may be mssed. Ths phenomenon motvates us to combne more than one source domans to avod the shortcomngs. Every doman contans 1000 postve revews (P) and 1000 negatve revews (N). In our experments we randomly choose 1000 out of these 2000 nstances n each doman for computatonal convenence. So we have 1000 4 nstances n all. Each nstance s represented as a sparse feature vector. The feature sets consst of the ungram that occur 5 to 1000 tmes n all the

revews. At the very begnnng, one-to-one lnear classfers are frstly traned wthout adaptaton. These classfers are regarded as baselnes. The baselnes are traned usng SVMs wth RBF kernel. Cross valdatons are employed once agan to select the parameters. Classfcaton accuraces are reported n TABLE 1 where the frst row represents the source domans. On the adaptaton stage, one of the four domans (B, D, K, E) s treated as target doman n turn, whle the others as source domans. The one-to-one baselnes are used here for adaptaton. Fgure 3 shows that BayesMSDA proposed n ths paper gves an encouragng result for bnary classfcaton. The BayesMSDA beats the majorty votng method on three domans (B, E, K). Comparng TABLE 1 and Fgure 3, we can conclude that BayesMSDA s a tradeoff between the best and the worst one-to-one lnear classfers. Because there are no labeled data at dsposal n the target doman, we do not know n advance whch one-to-one classfer s the best one and whch s the worst. Choosng classfers randomly s not accecptale. Our method s a better choce to get a resonable result, because even on doman D BayesMSDA outperforms the other two domans (74.1% VS 69.6% & 69.3%) except the best baselne. 4. Conclusons and future work In ths paper, a new Bayesan framework for mult-source doman adaptaton (BayesMSDA) s proposed. We focus on the case that there are lots of labeled data n source domans but no labeled data are at dsposal n the target doman. Our expermental results show that BayesMSDA s a better choce when no labeled data are avalable n the target doman. It s also an nterestng problem to consder the stuaton that there are some labeled nstances avalable n the target doman. In the future, we wll study ths problem. Acknowledgements Ths work s supported by the Natonal Natural Scence Foundaton of Chna under Project 61075005, and Shangha Knowledge Servce Platform Project (No. ZF1213). References [1] S.B. Davd, J. Bltzer, K. Crammer, A. Kulesza, F. Perera, and J.W. Vaughan, A Theory of Learnng from Dfferent Domans, Machne Learnng, Vol 79, pp. 151-175, 2010. [2] W. Tu, and S. Sun, Transferable Dscrmnatve Dmensonalty Reducton, Proceedngs of the 23rd IEEE Internatonal Conference on Tools wth Artfcal Intellgence, pp. 865-868, Nov. 2011. [3] B. Pang, L. Lee, and S. Vathyanathan, Thumbs Up? Sentment Classfcaton usng Machne Learnng Technques, Proceedngs of Emprcal methods n Natural Language Processng, Vol 10, pp. 79-86, 2002. [4] J. Bltzer, M. Dredze, and F. Perera, Bographes, Bollywood, Boom-boxes and Blenders: Doman Adaptaton for Sentment Classfcaton, Proceedngs of the 45th Annual Meetng of the Assocaton of Computatonal Lngustcs, pp. 440-447, Jun. 2007 [5] A. Aue, and M. Gamon, Customzng Sentment Classfers to New Domans: A Case Study, Proceedngs of Recent Advances n Natural Language Processng, 2005. [6] J. Jang, and C. Zha, Instance Weghtng for Doman Adaptaton n NLP, Proceedngs of the 45th Annual Meetng of the Assocaton of Computatonal Lngustcs, pp. 264-271, Jun. 2007. [7] K. Crammer, M. Kearns, and J. Wortman, Learnng from Multple Source, Journal of Machne Learnng Research, Vol 9, pp. 1757-1774, Jun. 2008. [8] W. Tu, and S. Sun, Dynamcal Ensemble Learnng wth Model-Frendly Classfers for Doman Adaptaton, Proceedngs of the 21st Internatonal Conference on Pattern Recognton 2012, pp. 1181-1184, Nov. 2012. [9] Y. Mansour, M. Mohr, and A. Rostamzadeh, Doman Adaptaton wth Multple Sources, Advances n Neural Informaton Processng Systems, Vol 21, pp. 1041-1048, 2008. [10] Y. Mansour, M. Mohr, and A. Rostamzadeh, Doman Adaptaton: Learnng Bounds and Algorthms, Proceedngs of the Conference on Learnng Theory, Jun. 2009. [11] M. Belkn, and P. Nyog, Laplacan Egenmaps for Dmensonalty Reducton and Data Representaton, Neural Computaton, Vol 15, pp. 1373-1396, Jun. 2003. [12] S. Sun, Mult-vew Laplacan Support Vector Machnes, Lecture Notes n Artfcal Intellgence, Vol 7121, pp. 209-222, Dec. 2011. [13] T. Joachms, Text Categorzaton wth Support Vector Machnes: Learnng wth Many Relevant Features, European Conference on Machne Learnng, pp. 137-142, Apr. 1998. [14] J. Shawe-Taylor, and S. Sun, A Revew of Optmzaton Methodologes n Support Vector Machnes, Neurocomputng, Vol 74, pp. 3609-3618, Oct. 2011.