Representation Learning using Multi-Task Deep Neural Networks for Semantic Classification and Information Retrieval

Representation Learning using Multi-Task Deep Neural Networks for Semantic Classification and Information Retrieval Xiaodong Liu 12, Jianfeng Gao 1, Xiaodong He 1 Li Deng 1, Kevin Duh 2, Ye-Yi Wang 1 1 Microsoft Research, USA 2 Nara Institute of Science and Technology, Japan

Learning Vector-Space Representation Why Significant accuracy gains in NLP tasks [Collobert+ 11] More compact models, easier to train and generalize better Existing learning methods are not optimal Use unsupervised objectives [Mikolov+ 11] Sub-optimal to the tasks of interest Use supervised objectives on a single task [Socher+ 13] Constrained by limited amounts of training data Our solution is inspired by multi-task learning [Caruana 97] 2

Multi-Task Deep Neural Nets for Representation Learning Leverage supervised data from many (related) tasks Reduce overfitting to a specific task Make the learned representations universal across tasks. Combine tasks as disparate as Semantic query classification, and Semantic web search Large scale experiments Higher accuracies on multiple tasks More compact models Easy to adapt to new tasks/domains 3

The Query Classification Task Given a search query Q, e.g., denver sushi downtown Identify its domain C e.g., Restaurant Hotel Nightlife Flight Thus, a search engine can tailor the interface and result to provide a richer personalized user experience

Problem Formulation For each domain C, build a binary classifier Input: represent a query Q as a vector of features x = [x 1, x n ] T Output: y = P 1 Q, C Q is labeled c is P 1 Q, C > 0.5 Input feature vector, e.g., a bag of words vector Regards words as atomic symbols: denver, sushi, downtown Each word is represented as a one-hot vector: 0,, 0,1,0,, 0 T Bag of words vector = sum of one-hot vectors Other (better) features: n-grams, phrases, (learned) topics, etc. How to construct optimal feature vectors for queries?

The Web Search Ranking Task Documents (D) Queries (Q) cold home remedy cold remeedy flu treatment how to deal with stuffy nose 6

Semantic Matching between Q and D Fuzzy keyword matching Q: cold home remedy D: best home remedies for cold and flu Spelling correction Q: cold remeedies D: best home remedies for cold and flu Query alteration/expansion Q: flu treatment D: best home remedies for cold and flu Query/document semantic matching Q: how to deal with stuffy nose D: best home remedies for cold and flu R&D progress 7

Problem Formulation Given a query Q, and a list of candidate docs D i, i = 1 N Rank D i according to their relevance to Q Represent Q and D as feature vectors, where features are Bag of words, phrases, (learned) topics, etc. Relevance cosine similarity of feature vectors of Q and D How to construct optimal feature vectors for queries and docs? 8

A DNN for Classification and DSSM for Ranking Classifier/Ranker that uses the hidden features as input Feature generation: project raw input features (bag of words) to hidden features (topics). Deep Structured Semantic Model (DSSM) [Huang+ 13] 9

The Proposed Multi-Task DNN Model 10

Shared Layers (l 1 and l 2 ) Word Hash Layer (l 1 ) Control the dimensionality of input using letter-3-gram e.g., cat #cat# #-c-a, c-a-t, a-t-# Only ~50K letter-trigrams in English; no OOV issue OOV words can be represented by letter-3-grams Spelling variations of the same word have similar representations Shared Semantic-Representation Layer (l 2 ) Captures cross-task semantic characteristics for arbitrary text (Q or D) l 2 = tanh(w 1 l 1 ) 11

Task Specific Representation (l 3 ) For each task, a nonlinear transformation maps l 2 into the task-specific representation via l 3 = tanh(w 2 t l 2 ) t denotes different tasks Model compactness result Compression from 500k-dim input to shared 300-dim semantic vector l 2 Multi-task DNN takes < 150KB in memory SVM using word-n-grams takes > 200MB Easy to add new domains, small memo footprint, fast runtime 12

Task-Specific Output Layers (P) Query classification: Q C 1 l 3 = tanh(w 2 t=c 1 l 2 ) P C 1 Q = sigmoid(w 3 t=c 1 Q C 1) Web search ranking Q and D are mapped into task representation Q S q and D S d. Relevance score is computed by cosine similarity as 13

The Training Procedure: Mini-Batch SGD i.e., cross-entropy loss i.e., pair-wise rank loss 14

Pair-Wise Rank Loss for Web Search Consider a query Q and two documents D + and D Assume D + is more relevant than D to Q sim θ Q, D is the cosine similarity of Q and D in semantic space, mapped by a neural network parameterized by θ Δ = sim θ Q, D + sim θ Q, D We want to maximize Δ 20 15 10 5 L Δ; θ = log(1 + exp γδ ) 0-2 -1 0 1 2

Experimental Evaluation Metrics AUC scores for query classification NDCG scores for web search ranking 16

Query Classification AUC Results MT-DNN > DNN: usefulness of multi-task objective over single-task objective DNN/MT-DNN > SVM-Letter w/ the same input l 1 : importance of learning a semantic representation l 2 DNN/MT-DNN > SVM-Word: power of deep learning 17

Web Search NDCG Results 18

Domain Adaptation on Query Classification To add a new task, how much training data to label? Experiment design Select one query classification task t, train MT-DNN on the remaining tasks to obtain a semantic representation (l 2 ) Given a fixed l 2, train a SVM on the training data of t, using varying amounts of labels Evaluate the AUC on the test data of t Compare 3 SVM classifiers trained using different feature vectors Semantic Representation (l 2 ) Word-n-grams, n = 1,2,3 Letter-3-grams 19

Domain Adaptation in Query Classification Using l 2 features, only small amounts of training labels are needed l 2 features are universally useful across domains/tasks 20

Conclusion Learning semantic representation using multi-task DNN Combine tasks as disparate as classification and ranking Consistently outperforms strong baselines Leads to a compact model Facilitates domain adaptation using learned representations Are the learned representations really semantic? What DNN learns are hidden features that are useful for a particular task? Semantic representations are universal in that they are useful for multiple tasks Multi-task DNN is a way to learn universal, semantic representations 21

Thanks! Q&A