Learning Transferable Features with Deep Adaptation Networks

Learning Transferable Features with Deep Adaptation Networks Mingsheng Long, Yue Cao, Jianmin Wang, Michael I. Jordan Presented by Changyou Chen October 30, 2015 1 Changyou Chen Learning Transferable Features with Deep Adaptation Networks

Outline Introduction 1 Introduction 2 Deep Adaptation Networks 3 Experiments 2 Changyou Chen Learning Transferable Features with Deep Adaptation Networks

Contribution Introduction Proposes a deep architecture for transfer learning. Based on a deep convolution neural network (AlexNet [Krizhevshy et al., 2012]). It is essentially an CNN with a particular regularizer, such that the distance between the distributions generating the source domain data and target domain data is minimized. 3 Changyou Chen Learning Transferable Features with Deep Adaptation Networks

Introduction Transfer learning (domain adaptation) This paper considers the unsupervised/semi-supervised domain adaptation setting. Given source domain D s = {(xi s,ys i )}n s i=1 generated from distribution p, and unlabelled target domain D t = {xj t}n t j=1 (unsupervised), or partially labelled D t = {(xi t,yt i )m t i=1,{xt j }} (semi-supervised), generated from distribution q. p and q are usually unknown. Transfer learning aims to build a classifier y = θ(x), which can minimize target risk ε t (θ) = Pr (x,y) q [θ(x) y], using source supervision: - in this paper, the classifier is built based on an CNN the AlexNet [Krizhevshy et al., 2012] 4 Changyou Chen Learning Transferable Features with Deep Adaptation Networks

Introduction The AlexNet [Krizhevshy et al., 2012] 8 layers deep models, with the first five layers being CNN, the 6-7 layers fully connected layers, and the last layer a softmax layer. It got state-of-the-art classification performance on imagenet in 2012. Building block for many state-of-the-art models. 5 Changyou Chen Learning Transferable Features with Deep Adaptation Networks

Introduction Multi-kernel maximum mean discrepancy (MK-MMD) MK-MMD describes distances between distributions in a reproducing kernel Hilbert space (RKHS). Let H k be an RKHS with kernel k. The mean embedding of a distribution p in H k is a unique element µ k (p) such that E x p f (x) =< f (x), µ k (p) > Hk, f H k. Given the feature map φ, MK-MMD defines the following objective as the RKHS distance between the mean embeddings of p and q: d 2 k(p,q) E p [φ(x s )] E q [φ(x t )] 2 H k (1) The kernel associated with φ is the convex combination of m PSD kernels {k u }: { } K k = m u=1 β u k u : m u=1 β u = 1,β u 0, u 6 Changyou Chen Learning Transferable Features with Deep Adaptation Networks (2)

Outline Deep Adaptation Networks 1 Introduction 2 Deep Adaptation Networks 3 Experiments 7 Changyou Chen Learning Transferable Features with Deep Adaptation Networks

Deep Adaptation Networks Deep adaptation networks (DAN) Two parallel AlexNets with sharing: - one for the source domain, the other for the target domain - first 5 convolutional layers are shared, with the first three fixed pretrain on the source domain, last two fine-tuned in the training using the target domain data - last 3 layers are individual, feedforward nets - last layer is the softmax layer - use MK-MMD to regularize that the distributions p and q generating the data are close to each other in the RKHS 8 Changyou Chen Learning Transferable Features with Deep Adaptation Networks

Deep Adaptation Networks Deep adaptation networks (DAN) Let Θ = {W l,b l } L l=1 the set of all model parameters, the objective function of DAN is: n 1 a min Θ n a J(θ(xi a ),y a 8 i ) +λ dk(d 2 l s,d l t) (3) i=1 l=6 } {{ }} {{ } CNN MK-MMD - (xi a,ya i ): labeled input data; θ(xa i ): softmax output; J: cross-entropy loss function - D l = {h l i }: l-th layer hidden representation - dk 2(Dl s,d l t) = E x s x sk(xs,x s ) + E x t x tk(xt,x t ) 2E x s x tk(xs,x t ): MK-MMD between the source and target Use an unbiased estimation of dk 2(Dl s,d l t) in SGD: dk(d 2 l s,d l t) = 2 n s /2 n s g k (z i ), (4) i=1 with z i (x s 2i 1,xs 2i,xt 2i 1,xt 2i ) and g k (z i ) k(x s 2i 1,xs 2i ) + k(xt 2i 1,xt 2i ) k(xs 2i 1,xt 2i ) k(xs 2i,xt 2i 1 ). 9 Changyou Chen Learning Transferable Features with Deep Adaptation Networks

Deep Adaptation Networks Deep adaptation networks (DAN) The gradient can be calculated as: - - J(z i ) Θ l g k (z l i ) Θ l Θ l = J(z i) Θ l + λ g k(z l i ) Θ l (5) is the same as in standard CNN can also be calculated easily Learning kernel weights β: - according to [Gretton et al., 2012], this is equivalent to: min β T (Q + εi)β, (6) d T β=1,β 0 with Q a matrix constructed from g k (z i ). 10 Changyou Chen Learning Transferable Features with Deep Adaptation Networks

Deep Adaptation Networks Theoretical property Theorem Let ε (θ) = Pr (x,y) [θ(x) y] be the expected risk of the source/target domain, then ε t (θ) ε s (θ) + 2d k (p,q) + C, (7) where C is a constant of the complexity of hypothesis space and the risk of an ideal hypothesis for both domains. 11 Changyou Chen Learning Transferable Features with Deep Adaptation Networks

Outline Experiments 1 Introduction 2 Deep Adaptation Networks 3 Experiments 12 Changyou Chen Learning Transferable Features with Deep Adaptation Networks

Datasets Experiments Office-31: - 4,652 images with 31 categories collected from: Amazon (A), Webcam (W) and DSLR (D) - evaluate transfers: A W, D W, W D, A D, D A, W A Office-10 + Caltech-10: - 10 categories shared by the Office-31 and Caltech-256 (C) datasets - evaluate transfers: A C, W C, D C, C A, C W, C D 13 Changyou Chen Learning Transferable Features with Deep Adaptation Networks

Setup Experiments Compared with TCA [Pan et al., 2011], GFK [Gong et al., 2012] (shallow models), and CNN [Krizhevshy et al., 2012], LapCNN [Weston et al., 2008], DDC [Tzeng et al., 2014] (deep models). Several variants of DAN: - DAN 7 : DAN with layer 7 to impose the MK-MMD - DAN 8 : DAN with layer 8 to impose the MK-MMD - DAN SK : DAN with a single kernel MMD Using Gaussian kernels with varying bandwidth (variance). Fix convolution layers conv1 conv3 with a pretrain on the source data with the AlexNet, fine-tune conv4 conv5 and fully connected layers fc6 fc8. 14 Changyou Chen Learning Transferable Features with Deep Adaptation Networks

Experiments Office-31: unsupervised 15 Changyou Chen Learning Transferable Features with Deep Adaptation Networks

Experiments Office-10 + Caltech-10: unsupervised 16 Changyou Chen Learning Transferable Features with Deep Adaptation Networks

Experiments Office-31: semi-supervised Deep models outperforms shallow models. Existing deep models can not deal well with the challenge of domain discrepancy. Multiple layers distribution adaptation (DAN) is better than single layer adaptation (DAN 7 or DAN 8 ), also better than single kernel variant (DAN SK ). 17 Changyou Chen Learning Transferable Features with Deep Adaptation Networks

Experiments Feature embedding with t-sne 18 Changyou Chen Learning Transferable Features with Deep Adaptation Networks

Experiments Thanks for your attention!!! 19 Changyou Chen Learning Transferable Features with Deep Adaptation Networks

References I Experiments Gretton, A., Sriperumbudur, B., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K.: Optimal kernel choice for large-scale two-sample tests. NIPS (2012) Pan, S. J., Tsang, I. W., Kwok, J. T., Yang, Q.: Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks and Learning Systems (2011) Gong, B., Shi, Y., Sha, F., Grauman, K.: Geodesic flow kernel for unsupervised domain adaptation. CVPR (2012) Krizhevshy, A., Sutskever, I., Hinton, G. E.: Imagenet classification with deep convolutional neural networks. NIPS (2012) Weston, J., Rattle, F., Collobert, R.: Deep learning via semi-supervised embedding. ICML (2008) Tzeng, E., Hoffman, J., Zhang, N., Saenko, K., Darrell, T.: Deep domain confusion: Maximising for domain invariance. Technical report, arxiv:1412.3474 (2014) 20 Changyou Chen Learning Transferable Features with Deep Adaptation Networks