Adaptive Transfer Learning

Similar documents
Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Simulation: Solving Dynamic Models ABE 5646 Week 11 Chapter 2, Spring 2010

Hermite Splines in Lie Groups as Products of Geodesics

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

y and the total sum of

Empirical Distributions of Parameter Estimates. in Binary Logistic Regression Using Bootstrap

Support Vector Machines

Cluster Analysis of Electrical Behavior

CS 534: Computer Vision Model Fitting

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance

Feature Reduction and Selection

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Smoothing Spline ANOVA for variable screening

Analysis of Continuous Beams in General

Proper Choice of Data Used for the Estimation of Datum Transformation Parameters

BAYESIAN MULTI-SOURCE DOMAIN ADAPTATION

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

S1 Note. Basis functions.

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur

Discriminative Dictionary Learning with Pairwise Constraints

Machine Learning 9. week

Fusion Performance Model for Distributed Tracking and Classification

Real-time Joint Tracking of a Hand Manipulating an Object from RGB-D Input

A Binarization Algorithm specialized on Document Images and Photos

Edge Detection in Noisy Images Using the Support Vector Machines

X- Chart Using ANOM Approach

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems

An Entropy-Based Approach to Integrated Information Needs Assessment

Lecture 4: Principal components

Three supervised learning methods on pen digits character recognition dataset

12/2/2009. Announcements. Parametric / Non-parametric. Case-Based Reasoning. Nearest-Neighbor on Images. Nearest-Neighbor Classification

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

User Authentication Based On Behavioral Mouse Dynamics Biometrics

Classifying Acoustic Transient Signals Using Artificial Intelligence

Lecture 5: Multilayer Perceptrons

Learning physical Models of Robots

The Research of Support Vector Machine in Agricultural Data Classification

Reducing Frame Rate for Object Tracking

Econometrics 2. Panel Data Methods. Advanced Panel Data Methods I

NAG Fortran Library Chapter Introduction. G10 Smoothing in Statistics

TN348: Openlab Module - Colocalization

Performance Evaluation of Information Retrieval Systems

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

Classifier Selection Based on Data Complexity Measures *

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Optimizing Document Scoring for Query Retrieval

A Semi-parametric Regression Model to Estimate Variability of NO 2

Fast Sparse Gaussian Processes Learning for Man-Made Structure Classification

An Image Fusion Approach Based on Segmentation Region

SVM-based Learning for Multiple Model Estimation

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

A Background Subtraction for a Vision-based User Interface *

The Codesign Challenge

Parameter estimation for incomplete bivariate longitudinal data in clinical trials

APPLICATION OF MULTIVARIATE LOSS FUNCTION FOR ASSESSMENT OF THE QUALITY OF TECHNOLOGICAL PROCESS MANAGEMENT

Corner-Based Image Alignment using Pyramid Structure with Gradient Vector Similarity

Unsupervised Learning

Domain-Constrained Semi-Supervised Mining of Tracking Models in Sensor Networks

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

EXTENDED BIC CRITERION FOR MODEL SELECTION

3D vector computer graphics

Wishing you all a Total Quality New Year!

Some Advanced SPC Tools 1. Cumulative Sum Control (Cusum) Chart For the data shown in Table 9-1, the x chart can be generated.

Machine Learning: Algorithms and Applications

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following.

Fast and Scalable Training of Semi-Supervised CRFs with Application to Activity Recognition

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

High-Boost Mesh Filtering for 3-D Shape Enhancement

A Similarity-Based Prognostics Approach for Remaining Useful Life Estimation of Engineered Systems

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

Simulation Based Analysis of FAST TCP using OMNET++

Improved Methods for Lithography Model Calibration

Fuzzy Filtering Algorithms for Image Processing: Performance Evaluation of Various Approaches

R s s f. m y s. SPH3UW Unit 7.3 Spherical Concave Mirrors Page 1 of 12. Notes

Synthesizer 1.0. User s Guide. A Varying Coefficient Meta. nalytic Tool. Z. Krizan Employing Microsoft Excel 2007

EYE CENTER LOCALIZATION ON A FACIAL IMAGE BASED ON MULTI-BLOCK LOCAL BINARY PATTERNS

Relevance Assignment and Fusion of Multiple Learning Methods Applied to Remote Sensing Image Analysis

Type-2 Fuzzy Non-uniform Rational B-spline Model with Type-2 Fuzzy Data

Object-Based Techniques for Image Retrieval

MOTION BLUR ESTIMATION AT CORNERS

A Statistical Model Selection Strategy Applied to Neural Networks

Active Contours/Snakes

Biostatistics 615/815

Life Tables (Times) Summary. Sample StatFolio: lifetable times.sgp

Learning Topic Structure in Text Documents using Generative Topic Models

Online Detection and Classification of Moving Objects Using Progressively Improving Detectors

Solving two-person zero-sum game by Matlab

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 15

APPLICATION OF PREDICTION-BASED PARTICLE FILTERS FOR TELEOPERATIONS OVER THE INTERNET

Brushlet Features for Texture Image Retrieval

Lecture 5: Probability Distributions. Random Variables

Bayesian Approach for Fatigue Life Prediction from Field Inspection

Positive Semi-definite Programming Localization in Wireless Sensor Networks

Intelligent Information Acquisition for Improved Clustering

Modeling Waveform Shapes with Random Effects Segmental Hidden Markov Models

An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices

A New Approach For the Ranking of Fuzzy Sets With Different Heights

THE THEORY OF REGIONALIZED VARIABLES

Transcription:

Adaptve Transfer Learnng Bn Cao, Snno Jaln Pan, Yu Zhang, Dt-Yan Yeung, Qang Yang Hong Kong Unversty of Scence and Technology Clear Water Bay, Kowloon, Hong Kong {caobn,snnopan,zhangyu,dyyeung,qyang}@cse.ust.hk Abstract Transfer learnng ams at reusng the knowledge n some source tasks to mprove the learnng of a target task. Many transfer learnng methods assume that the source tasks and the target task be related, even though many tasks are not related n realty. However, when two tasks are unrelated, the knowledge extracted from a source task may not help, and even hurt, the performance of a target task. Thus, how to avod negatve transfer and then ensure a safe transfer of knowledge s crucal n transfer learnng. In ths paper, we propose an Adaptve Transfer learnng algorthm based on Gaussan Processes (AT-GP), whch can be used to adapt the transfer learnng schemes by automatcally estmatng the smlarty between a source and a target task. The man contrbuton of our work s that we propose a new sem-parametrc transfer kernel for transfer learnng from a Bayesan perspectve, and propose to learn the model wth respect to the target task, rather than all tasks as n mult-task learnng. We can formulate the transfer learnng problem as a unfed Gaussan Process (GP) model. The adaptve transfer ablty of our approach s verfed on both synthetc and real-world datasets. Introducton Transfer learnng (or nductve transfer) ams at transferrng the shared knowledge from one task to other related tasks. In many real-world applcatons, we expect to reduce the labelng effort of a new task (referred to as target task) by transferrng knowledge from one or more related tasks (source tasks) whch have plenty of labeled data. Usually, the accomplshment of transfer learnng s based on certan assumptons and the correspondng transfer schemes. For example, (Lawrence and Platt 2004; Schwaghofer, Tresp, and Yu 2005; Rana, Ng, and Koller 2006; Lee et al. 2007) assume that related tasks should share some (hyper- )parameters. By dscoverng the shared (hyper-) parameters, the knowledge can be transferred across tasks. Other algorthms, such as (Da et al. 2007; Rana et al. 2007), assume that some nstances or Copyrght c 2010, Assocaton for the Advancement of Artfcal Intellgence (www.aaa.org). All rghts reserved. features can be used as a brdge for knowledge transfer. If these assumptons fal to be satsfed, however, the transfer may be nsuffcent or unsuccessful. In the worst case, t may even hurt the performance, whch can be referred to as negatve transfer (Rosensten and Detterch 2005). Snce t s not trval to verfy whch assumptons hold for real-world tasks, we are nterested n pursung an adaptve transfer learnng algorthm whch can automatcally adapt the transfer schemes n dfferent scenaros and then avod negatve transfer. We expect the adaptve transfer learnng algorthm to at least demonstrate the followng propertes: The shared knowledge between tasks should be transferred as much as possble when these tasks are related. An extreme case s that when they are exactly the same task, the performance of the adaptve transfer learnng algorthm should be as good as that when t s consdered as a sngle-task problem. Negatve transfer should be avoded as much as possble when these tasks are unrelated. An extreme case s when these tasks are totally unrelated, the performance of the adaptve transfer learnng algorthm should be no worse than that of the non-transferlearnng baselnes. Two basc transfer-learnng schemes can be constructed based the above requrements. One s a no transfer scheme, whch dscards the data n the source task when tranng a model for the target task. Ths would be the best scheme when the source and the target tasks are not related at all. The other s transfer all scheme that consders the data n the source task to be the same as those n the target task. Ths would be the best scheme when the source and target tasks are exactly the same. What we wsh to get s an adaptve scheme that s always no worse than the two schemes. However, gven that there are so many transfer learnng algorthms that have been proposed, a mechansm has been lackng to automatcally adjust ts transfer schemes to acheve ths. In ths paper, we address the problem of constructng an adaptve transfer learnng algorthm that satsfes both propertes mentoned above. We propose an Adaptve Transfer learnng algorthm based on Gaus-

san Process (AT-GP) to acheve the goal of adaptve transfer. Advantages of Gaussan process methods nclude that the prors and hyper-parameters of the traned models are easy to nterpret as well as that varances of predctons can be provded. Dfferent from prevous works on transfer learnng and mult-task learnng usng GP whch are ether based on transferrng through shared parameters (Lawrence and Platt 2004; Yu, Tresp, and Schwaghofer 2005; Schwaghofer, Tresp, and Yu 2005) or shared representaton of nstances (Rana et al. 2007), the model proposed n ths paper can automatcally learn the transfer scheme from the data. Our key dea s to learn a transfer kernel to model the correlaton of the outputs when the nputs come from dfferent tasks, whch can be regarded as a measure of smlarty between tasks. What to transfer s based on how smlar the source s to the target task. On one hand, f the tasks are very smlar then the knowledge would be transferred from the source data and the learnng performance would tend to the transfer all scheme n the extreme case. On the other hand, f the tasks are not smlar, the model would only transfer the pror nformaton on the parameters to approxmate the no transfer scheme. Snce we have very few labeled data for the target task, we consder a Bayesan estmaton of the task smlarty rather than pont estmaton (Gelman et al. 2003). A sgnfcant dfference between our problem and multtask learnng s that we only care about the target task rather than all tasks, whch s a very natural scenaro n real world applcatons. For example, we may want to use the prevous learned tasks to help learn a new task. Therefore, our target s to mprove the new task rather than the old ones. For ths purpose, the learnng process should focus on the target task rather than all tasks. Therefore, we propose to learn the model based on the condtonal dstrbuton of the target task gven the source task, whch s a novel varaton of the classcal Gaussan process model. The Adaptve Transfer Learnng Model va Gaussan Process We consder regresson problems n ths paper. Suppose that we have a regresson problem as a source task S wth a large amount of tranng data and another regresson problem as a target task T wth a small amount of tranng data. Let y (S) correspondng to the nput x (S) the source task and y (T ) j denote the observed output of the th nstance n denote the observed output of the j th nstance x (T ) j n the target task. We assume that the underlyng latent functon between the nput and output for the source task s f (S). Let f (S) be the vector wth n th element f (S) (x (S) ) and we have a notaton f (T ) for the target task. Suppose we have N data nstances for the source task and M data nstances for the target data, then f (S) s of length N and f (T ) s of length M. We model the nose on observatons by an addtve nose term, y (S) = f (S) + ɛ (S), y (T ) j = f (T ) j + ɛ (T ) j where f ( ) = f ( ) (x ( ) ) 1. The pror dstrbuton (GP pror) over the latent varables f ( ), s gven by a GP p(f ( ) ) = N (f ( ) 0, K ( ) ), wth the kernel matrx K ( ). The notaton 0 denotes a vector wth all entres beng zero. We assume that the nose ɛ ( ) s a random nose varable whose value s ndependent for each observaton y ( ) and follows a zero-mean Gaussan, p(y ( ) f ( ) ) = N (y ( ) f ( ), β 1 ( ) ) (1) where β s and β t are hyper-parameters representng the precson (nverse varance) of the nose n the source and target tasks, respectvely. Snce the nose varables are..d., the dstrbuton of observed outputs y (S) = (y (S) 1,, y (S) N )T and y (T ) = (y (T ) 1,, y (T ) M )T condtoned on correspondng nputs f (S) and f (T ) can be wrtten n a Gaussan form as follows p(y ( ) f ( ) ) = N (y ( ) f ( ), β 1 ( ) I)) (2) where I s the dentty matrx wth proper dmensons. In order to transfer knowledge from the source task S to the target task T, we need to construct connectons between them. In general, there are two knds of connectons between the source and the target tasks. One s that the two GP regresson models for the source and target tasks share the same parameters θ n ther kernel functons. Ths ndcates that the smoothness of the regresson functons of the source and target tasks are smlar. Ths type of transfer scheme s ntroduced n (Lawrence and Platt 2004) for GP models. Many other mult-task learnng models also use smlar schemes by sharng prors or regularzaton terms over tasks (Lee et al. 2007; Rana, Ng, and Koller 2006; Ando and Zhang 2005). The other knd of connecton s the correlaton between outputs of data nstances between tasks (Bonlla, Agakov, and Wllams 2007; Bonlla, Cha, and Wllams 2008). Unlke the frst knd (Lawrence and Platt 2004), we do not assume the data n dfferent tasks to be ndependent of each other gven the shared GP pror, but consder the jont dstrbuton of outputs of both tasks. The connecton through shared parameters gves t the parametrc flavor whle the connecton through correlaton of data nstances gves t the nonparametrc flavor. Therefore our model may be regarded as a semparametrc model. Suppose the dstrbuton of observed outputs condtoned on the nputs X s p(y X), where y = (y (S), y (T ) ) and X = (X (S), X (T ) ). For mult-task learnng problems where the tasks are equally mportant, the objectve would be the lkelhood p(y X). However, for transfer learnng where we have a clear 1 We use ( ) to denote both (S) and (T ) to avod redundancy.

target task, t s not necessary to optmze the parameters wth respect to the source task. Therefore, we drectly consder the condtonal dstrbuton p(y (T ) y (S), X (T ), X (S) ). Let f = (f (S), f (T ) ), we frst defne a Gaussan process over f, p(f X, θ) = N (f 0, K), and the kernel matrx K for transfer learnng K nm k(x n, x m )e ζ(xn,xm)ρ, (3) where ζ(x n, x m ) = 0 f x n and x m come from the same task, otherwse, ζ(x n, x m ) = 1. The ntuton behnd Equaton (3) s that the addtonal factor makes the correlaton between nstances of the dfferent tasks are less or equal to the correlaton between the ones n the same task. The parameter ρ represents the dssmlarty between S and T. One dffculty n transfer learnng s to estmate the (ds)smlarty wth lmt amount of data. We propose a Baysan approach to tackle ths dffculty. Therefore, nstead of usng a pont estmaton, we can consder ρ s from a Gamma dstrbuton ρ Γ(b, µ). We now have the transfer kernel as K nm = E[K nm ] = k(x n, x m ) e ζ(xn,xm)ρ b 1 e ρ/µ ρ µ b Γ(b) dρ. By ntegratng out ρ, we can obtan, ( ) b 1 k(x K nm = n, x m ), ζ(x n, x m ) = 1, 1 + µ k(x n, x m ), otherwse. (4) The factor before kernel functon has range of [0, 1]. Therefore, the above form of kernel does not consder the negatve correlaton between tasks. Therefore, we can further extend t nto the followng form K nm k(x n, x m )(2e ζ(xn,xm)ρ 1), (5) and ts Bayesan form ( ( ) b 1 k(x n, x m) 2 1), ζ(x n, x m) = 1, K nm = 1 + µ k(x n, x m), otherwse. (6) Theorem 1 shows that the kernel matrces defned n Equaton (4) and Equaton (6) are postve semdefnte (PSD) matrces as long as k s a vald kernel functon. Both transfer kernel models the correlaton of outputs based on not only the smlarty between nputs but also the smlarty between tasks. Snce the kernel n Equaton (6) has the ablty to model negatve correlaton between tasks and therefore has stronger expresson ablty, we use t as the transfer kernel. We wll further dscuss ts propertes n later secton. Thus, the condtonal dstrbuton of f (T ) gven f (S) can be wrtten as follows p(f (T ) f (S), X (T ), θ) = N (K 21K 1 11 f (S), K 22 K 21K 1 11 K 12), ( ) K11 K where K = 12 s a block matrx. K K 21 K 11 and 22 K 22 are the kernel matrces of the data n the source task and target task, respectvely. K 12 (= K T 21) s the kernel matrx across tasks. ( ) K11 K Theorem 1. Let K = 12 be a PSD matrx ( wth K 12 ) = K T 21. Then for λ 1, K = K 21 K 22 K11 λk 12 s also a PSD matrx. λk 21 K 22 We omt the proof here to reduce space. 2 So far, we have descrbed how to construct a unfed GP regresson model for adaptve transfer learnng. In the followng subsectons, we wll dscuss how to do nference and parameter learnng n our proposed GP regresson model. Inductve Inference For a test pont x n the target task, we want to predct ts output value y by determnng the predctve dstrbuton p(y y (S), y (T ) ), where, for smplcty, the nput varables are omtted. The nference process of the model s the same as that n standard GP models. The mean and varance of the predctve dstrbuton of the target task data are gven by m(x) = k C 1 x y, σ 2 T (x) = c k C 1 x k x, (7) ( ) where C = K+Λ β 1 s I and Λ = N 0 0 βt 1, and c = I M k(x, x) + βt 1 and k x can be calculated by the transfer kernel defned n Equaton (3). Therefore, m(x) can be further decomposed as follows m(x) = α j k(x, x j ) + λα k(x, x ), (8) x j X (T ) x X (S) where λ = 2( 1 1+µ )b 1 and α s the th element of C 1 y. The frst term n the above formula represents the correlaton between the test data pont and the data n the target task. The second term represents the correlaton between the test data pont and the source task data where a shrnkage s ntroduced based on the smlarty between tasks. Parameter Learnng Gven the observatons y (S) n the source task and y (T ) n the target task, we wsh to learn parameters {θ } P =1 (P s the number of parameters n the kernel functon) n the kernel functon as well as the parameter b, µ (denoted by θ P +1 and θ P +2 for smplcty) by maxmzng the margnal lkelhood of data of the target task. Multtask GP models (Bonlla, Cha, and Wllams 2008) consder the jont dstrbuton of source 2 The proof of the theorem can be found at http://home.ust.hk/ caobn/papers/atgp ext.pdf

and target tasks. However, for transfer learnng problems, we may only have relatvely few labeled data n the target task and optmze wth respect to the jont dstrbuton may bas the model towards source rather than target. Therefore, we propose to optmze the condtonal dstrbuton nstead, p(y (T ) y (S), X (T ), X (S) ). (9) As we analyzed before, ths dstrbuton s also a Gaussan and the model s stll a GP. A slght dfference between ths model and classcal GP s that ts mean s not a zero vector any more and t s also a functon of the parameters. where p(y (T ) y (S), X (T ), X (S) ) N (µ t, C t ), (10) µ t = K 21 (K 11 + σ 2 si) 1 y s, C t = (K 22 + σ 2 t I) K 21 (K 11 + σ 2 si) 1 K 12, (11) and K 11 (x n, x m ) = K 22 (x n, x m ) = k(x n, x m ) and K 21 (x n, x m ) = K 12 (x n, x m ) = k(x n, x m )(2( 1 1+µ )b 1). The log-lkelhood equaton s gven as follows ln p(y t θ) = 1 2 ln C 1 t 2 (yt µt)t C 1 t (y N t µ t) 2 ln(2π). (12) We can compute the dervatve of the log-lkelhood wth respect to the parameters, ln p(y θ) = 1 C t θ 2 Tr(C 1 t ) θ + 1 2 (yt µt)t C 1 C t t C 1 t (y t µ t) θ + ( µt θ ) T C 1 t (y t µ t) The dfference between the proposed learnng model and classcal GP learnng models s the exstence of the last term n the above equaton and non-zero mean Gaussan process. However, the standard nference and learnng algorthms can stll be used. Thus, many approxmaton technques for GP models (Bottou et al. 2007) can also be appled drectly to speed-up the nference and learnng processes of AT-GP. Transfer Kernel: Modelng Correlaton Between Tasks As mentoned above, our man contrbuton s the proposed sem-parametrc transfer kernel for transfer learnng. In ths secton, we further dscuss ts powerful propertes for modelng correlatons between tasks. In general, the kernel functon n GP expresses that for ponts x n and x m that are smlar, the correspondng values y(x n ) and y(x m ) wll be more strongly correlated than for dssmlar ponts. In the transfer learnng scenaro, the correlaton between y(x n ) and y(x m ) also depends on whch tasks the nputs x n and x m come from and how smlar the tasks are. Therefore the transfer kernel expresses that for ponts x n and x m from dfferent tasks, how the correspondng values y(x n ) and y(x m ) are correlated. The transfer kernel can transfer through dfferent schemes n three cases: Transfer over prors: λ 0, meanng we know the source and target tasks are not smlar or have no confdence on ther relaton. When the correlatons between data n the source and target tasks are slm, what we transfer s only the shared parameters n the kernel functon k. So we only requre the degree of smoothness of the source and target tasks to be shared. Transfer over data: 0 < λ < 1. In ths case, besdes the smoothness nformaton, the model drectly transfers data from the source task to the target task. How much the data n the source task nfluence the target task depends on the value of λ. Sngle task problem: λ = 1, meanng we have hgh confdence the task s extremely correlated, we can treat the two tasks to be one. In ths case, t s equvalent to the transfer all scheme. The learnng algorthm can automatcally determne nto whch settng the problem falls. Ths s acheved by estmatng λ on the labeled data from both the source and target tasks. Experments n the next secton show that only a few labeled data are requred to estmate λ well. Experments Synthetc Dataset In ths experment, we show how our proposed AT-GP model performs when the smlarty between the source task and target task changes. We generate a synthetc data set to test our AT-GP algorthm frst, n order to better llustrate the propertes of the algorthm under dfferent parameter settngs. We use a lnear regresson problem as a case study. Frst, we are gven a lnear regresson functon f(x) = w0 T x + ɛ where w 0 R 100 and ɛ s a zero-mean Gaussan nose term. The target task s to learn ths regresson model wth a few data generated by ths model. In our experment, we use ths functon to generate 500 data for the target task. Among them, 50 data are randomly selected for tranng and the rest s used for testng. For the source task, we use g(x) = w T x + ɛ = (w 0 + δ w) T x + ɛ to generate 500 data for tranng, where w s randomly generated vector and δ s the varable controllng the dfference between g and f. In the experment we ncrease δ and vary the dstance between the two tasks D f = w w 0 F. Fgure (2) shows how the mean absolute error (MAE) on 450 target test data changes at dfferent dstance between the source and target tasks. The results are compared wth the transfer all scheme (drectly use all of the tranng data) and the no transfer scheme (only use tranng data n the target task). As we can see, when the two tasks are very smlar, the AT-GP model performance s as good as transfer all, whle when the tasks are very dfferent, the AT-GP

0 5 10 15 20 25 30 35 50 120 1 0.45 45 100 Transfer All(GP) No Transfer(GP) AT GP 0.9 0.4 0.35 40 35 80 0.8 0.3 30 MAE 60 λ 0.7 λ λ* 0.25 0.2 MAE 25 20 0.6 0.15 15 40 0.5 0.1 10 20 0.4 0.05 5 0 D f 0 5 10 15 20 25 30 35 D f Fgure 2: The left fgure shows the change to MAE wth ncreasng dstance wth f. The results are compared wth transfer all and no transfer; The rght fgure shows the change to λ wth ncreasng dstance wth f. We can see that λ s strongly correlated wth D f. 0 0 10 20 30 40 50 60 70 80 90 100 Num. of Data 0 0 10 20 30 40 50 60 70 80 90 100 Num. of Data Fgure 4: Learnng wth dfferent numbers of labeled data n the target task. The left fgure shows the convergence curve of λ wth respect to the number of data. The rght fgure shows the change to MAE on test data. (λ s the value of λ after convergence and λ = 0.3 here.) model s no worse than no transfer. Fgure (4) shows the expermental results on learnng λ under a varyng number of labeled data n the target task. It s nterestng to observe that the number of data requred to learn λ well (left fgure) s much less than the number of data requred to learn the task well (rght fgure). Ths ndcates why transfer learnng works. Real-World Datasets In ths secton, we conduct experments on three real world datasets. WF Localzaton 3 : The task s to predct the locaton of each collecton of receved sgnal strength (RSS) values n an ndoor envronment, receved from the WF Access Ponts (APs). A set of (RSS values, Locaton) data s gven as tranng data. The tranng data are collected at a dfferent tme perod from the test data, so there exsts a dstrbuton change between the tranng and test data. In WF locaton estmaton, when we use the outdated data as the tranng data, the error can be very large. However, because the locaton nformaton s constant across tme, there s a certan part of the data that can be transferred. If ths can be done successfully, we can save a lot of manual labellng effort for the new tme perod. Therefore, we want to use the outdated data as the source task to help predct the locaton for current sgnals. Dfferent from mult-task learnng whch cares about the performances of all tasks, n ths scenaro we only care about the performance of current data correspondng to the target task. Wne 4 : The dataset s about wne qualty ncludng red and whte wne samples. The features nclude objectve tests (e.g. PH values) and the output s based on sensory data. The labels are gven by experts wth grades between 0 (very bad) and 10 (very excellent). There are 1599 records for the red wne and 4898 for the whte wne. We use the qualty predcton problem 3 http://www.cs.ust.hk/ qyang/icdmdmc07/ 4 http://archve.cs.uc.edu/ml/datasets/wne+qualty for the whte wne as the source task and the qualty predcton problem for red wne as the target task. SARCOS 5 : The dataset relates to an nverse dynamcs problem for a seven degrees-of-freedom SARCOS anthropomorphc robot arm. The task s to map from a 21-dmensonal nput space (7 jont postons, 7 jont veloctes, 7 jont acceleratons) to the correspondng 7 jont torques. The orgnal problem s a mult-output regresson problem. It can also be treated as multtask learnng problem by treatng the seven mappngs as seven tasks. In ths paper we use one of the task as the target task and another as the source task to test our algorthm. Therefore, we can form 49 task pars n total for our experments. In our experments, all data n the source task and 5% of the data n the target task are used for tranng. The remanng 95% data n the target task are used for evaluaton. We use NMSE (Normalzed Mean Square Error) for the evaluaton of results on Wne and SAR- COS datasets and error dstance (n meter) for WF. A smaller value ndcates a better performance for both evaluaton crtera. The average performance results are shown n Table 1, where No and All are GP models wth no-transfer and transfer-all schemes, and Mult-1 s (Lawrence and Platt 2004) and Mult-2 s (Bonlla, Cha, and Wllams 2008). Dscusson We further dscuss the expermental results n ths secton. For the task pars n the datasets, sometmes the source task and target task would be qute related, such as the case of WF dataset. In these cases, the λ parameter learned n the model would be large, allowng the shared knowledge to be transferred successfully. However, n other cases such as the ones on the SARCOS dataset, the source and target tasks may not be related and negatve transfer may occur. A safer way s to use parameter transfer scheme (Mult-1 n (Lawrence and Platt 2004)) or the no transfer scheme to avod 5 http://www.gaussanprocess.org/gpml/data/

Data No All Mult-1 Mult-2 AT Wne 1.33+0.3 1.37+0.7 1.69+0.5 1.27+0.3 1.16+0.3 SARCOS 0.21+0.1 1.58+1.3 0.24+0.1 0.26+0.3 0.18+0.1 WF 9.18+1.5 5.28+1.3 9.35+1.4 11.92+1.8 4.98+0.6 Table 1: Results on three real world datasets. The NMSE of all source/target-task pars are reported for the dataset Wne and SARCOS, whle error dstances (n meter) are reported for the dataset WF. Both means (before plus) and standard devaton (after plus) are reported. We have conduct t-tests whch show the mprovements are sgnfcant wth sgnfcance level 0.05. negatve transfer. The drawback of parameter transfer transfer scheme or no transfer scheme s that they may lose a lot of shared knowledge when the tasks are smlar. Besdes, snce mult-task learnng cares about both the source and target tasks wth no dfference and the source task may domnate the learnng of parameters, the performance of the target task may even worse than no transfer case, as for the SARCOS dataset. However, what we should be focused on s the target task. In our method, we conduct the learnng process on the target task and the learned parameters would ft the target task. Therefore, the AT-GP model performs the best on all three datasets. In many real world applcatons, t s hard to know exactly whether the tasks are related or not. Snce our method can adjust the transfer schema automatcally accordng to the smlarty of the two tasks, we are able to adaptvely transfer the shared knowledge as much as possble and avod negatve transfer. Related Work Mult-task learnng s closely related to transfer learnng. Many papers (Yu, Tresp, and Schwaghofer 2005; Schwaghofer, Tresp, and Yu 2005) consder mult-task learnng and transfer learnng as the same problem. Recently, varous GP models have been proposed to solve mult-task learnng problems. Yu et al. n (Yu, Tresp, and Schwaghofer 2005; Schwaghofer, Tresp, and Yu 2005) proposed the herarchcal Gaussan process model for mult-task learnng. Lawrence n (Lawrence and Platt 2004) also proposed a mult-task learnng model based on Gaussan process. Ths model tres to dscover the common kernel parameters over dfferent tasks and the nformatve vector machne was ntroduced to solve large-scale problems. In (Bonlla, Cha, and Wllams 2008) Bonlla et al. proposed a mult-task regresson model usng Gaussan process. They consdered the smlarty between tasks and constructed a free-form kernel matrx to represent task relatons. The major dfference between ther model and ours s the constructed kernel matrx. They consder a pont estmaton of the correlatons between tasks, whch may not be robust when data n target task s small. They also treat the tasks equally mportant rather than the transfer settng. One dfference of transfer learnng from mult-task learnng s that n transfer learnng we are partcularly nterested n transferrng knowledge from one or more source tasks to a target task rather than learnng these tasks smultaneously. What we concern s the performance n the target task only. On the problem of adaptve transfer learnng, to our best knowledge, only (Rosensten and Detterch 2005) addressed the problem of negatve transfer, but they stll faled to acheve adaptve transfer. Concluson In ths paper, we proposed an adaptve transfer Gaussan process (AT-GP) model for adaptve transfer learnng. Our proposed model can automatcally learn the smlarty between tasks. Accordng to our method, how much to transfer s based on how smlar the tasks are and negatve transfer can be avoded. The experments on both synthetc and real-world datasets verfy the effectveness of our proposed model. Acknowledgments Bn Cao, Snno Jaln Pan and Qang Yang thank the support of RGC/NSFC grant N HKUST624/09. References Ando, R. K., and Zhang, T. 2005. A framework for learnng predctve structures from multple tasks and unlabeled data. J. Mach. Learn. Res. 6. Bonlla, E. V.; Agakov, F.; and Wllams, C. 2007. Kernel mult-task learnng usng task-specfc features. In In Proc. of the Eleventh Internatonal Conference on Artfcal Intellgence and Statstcs AISTATS 07. Bonlla, E.; Cha, K. M.; and Wllams, C. 2008. Mult-task gaussan process predcton. In Platt, J.; Koller, D.; Snger, Y.; and Rowes, S., eds., NIPS 20. MIT Press. Bottou, L.; Chapelle, O.; DeCoste, D.; and Weston, J., eds. 2007. Large Scale Kernel Machnes. Cambrdge: MIT Press. Da, W.; Yang, Q.; Xue, G.-R.; and Yu, Y. 2007. Boostng for transfer learnng. In Proc. of the 24th ICML. ACM. Gelman, A.; Carln, J. B.; Stern, H. S.; and Rubn, D. B. 2003. Bayesan Data Analyss. Chapman & Hall/CRC, second edton. Lawrence, N. D., and Platt, J. C. 2004. Learnng to learn wth the nformatve vector machne. In Proc. of the 21st ICML. Banff, Alberta, Canada: ACM. Lee, S.-I.; Chatalbashev, V.; Vckrey, D.; and Koller, D. 2007. Learnng a meta-level pror for feature relevance from multple related tasks. In Proc. of the 24th ICML, 489 496. Corvals, Oregon: ACM. Rana, R.; Battle, A.; Lee, H.; Packer, B.; and Ng, A. Y. 2007. Self-taught learnng: transfer learnng from unlabeled data. In ICML 07. New York, NY, USA: ACM. Rana, R.; Ng, A. Y.; and Koller, D. 2006. Constructng nformatve prors usng transfer learnng. In Proc. of the 23rd ICML. Pttsburgh, Pennsylvana: ACM. Rosensten, M. T., M. Z. K. L. P., and Detterch, T. G. 2005. To transfer or not to transfer. In NIPS 2005 Workshop on Transfer Learnng. Schwaghofer, A.; Tresp, V.; and Yu, K. 2005. Learnng gaussan process kernels va herarchcal bayes. In NIPS 17. Yu, K.; Tresp, V.; and Schwaghofer, A. 2005. Learnng gaussan processes from multple tasks. In Proc. of the 22nd ICML. Bonn, Germany: ACM.