Representation Learning using Multi-Task Deep Neural Networks for Semantic Classification and Information Retrieval

Similar documents
A Deep Relevance Matching Model for Ad-hoc Retrieval

Learning Semantic Entity Representations with Knowledge Graph and Deep Neural Networks and its Application to Named Entity Disambiguation

Deep neural networks II

A Hybrid Neural Model for Type Classification of Entity Mentions

JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation

End-to-End Neural Ad-hoc Ranking with Kernel Pooling

Reading group on Ontologies and NLP:

Deep Learning Applications

Sentiment Classification of Food Reviews

Semantic Estimation for Texts in Software Engineering

Neural Networks for Machine Learning. Lecture 15a From Principal Components Analysis to Autoencoders

learning stage (Stage 1), CNNH learns approximate hash codes for training images by optimizing the following loss function:

Outline. Morning program Preliminaries Semantic matching Learning to rank Entities

YJTI at the NTCIR-13 STC Japanese Subtask

Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks

Lecture 20: Neural Networks for NLP. Zubin Pahuja

Learning to Rank for Faceted Search Bridging the gap between theory and practice

Layerwise Interweaving Convolutional LSTM

arxiv: v1 [cs.ir] 16 Oct 2017

Grounded Compositional Semantics for Finding and Describing Images with Sentences

Chapter 6: Information Retrieval and Web Search. An introduction

Kernels vs. DNNs for Speech Recognition

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

NUS-I2R: Learning a Combined System for Entity Linking

Sentence selection with neural networks over string kernels

Multimodal Medical Image Retrieval based on Latent Topic Modeling

PTE : Predictive Text Embedding through Large-scale Heterogeneous Text Networks

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Deep Model Compression

Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank text

CS229 Final Project: Predicting Expected Response Times

ImageNet Classification with Deep Convolutional Neural Networks

Entity and Knowledge Base-oriented Information Retrieval

Applying Supervised Learning

Metric Learning for Large-Scale Image Classification:

Ranking and Learning. Table of Content. Weighted scoring for ranking Learning to rank: A simple example Learning to ranking as classification.

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

Ensemble methods in machine learning. Example. Neural networks. Neural networks

Deep Learning. Deep Learning. Practical Application Automatically Adding Sounds To Silent Movies

LSTM for Language Translation and Image Captioning. Tel Aviv University Deep Learning Seminar Oran Gafni & Noa Yedidia

A Study of MatchPyramid Models on Ad hoc Retrieval

CS 1674: Intro to Computer Vision. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh November 16, 2016

CS 224d: Assignment #1

Information Retrieval. Information Retrieval and Web Search

Joint Shape Segmentation

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2018

FastText. Jon Koss, Abhishek Jindal

Query-by-example spoken term detection based on phonetic posteriorgram Query-by-example spoken term detection based on phonetic posteriorgram

Taming Text. How to Find, Organize, and Manipulate It MANNING GRANT S. INGERSOLL THOMAS S. MORTON ANDREW L. KARRIS. Shelter Island

Hidden Markov Models. Gabriela Tavares and Juri Minxha Mentor: Taehwan Kim CS159 04/25/2017

Outline. Eg. 1: DBLP. Motivation. Eg. 2: ACM DL Portal. Eg. 2: DBLP. Digital Libraries (DL) often have many errors that negatively affect:

Pair-wise Distance Metric Learning of Neural Network Model for Spoken Language Identification

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University

A Deep Top-K Relevance Matching Model for Ad-hoc Retrieval

S-MART: Novel Tree-based Structured Learning Algorithms Applied to Tweet Entity Linking

Learning to Reweight Terms with Distributed Representations

CS 224N: Assignment #1

Information Retrieval and Web Search

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

Neural Networks for unsupervised learning From Principal Components Analysis to Autoencoders to semantic hashing

Seq2SQL: Generating Structured Queries from Natural Language Using Reinforcement Learning

Manifold Constrained Deep Neural Networks for ASR

Automatic Record Linkage using Seeded Nearest Neighbour and SVM Classification

CNN optimization. Rassadin A

Machine Learning. MGS Lecture 3: Deep Learning

CS 179 Lecture 16. Logistic Regression & Parallel SGD

ABC-CNN: Attention Based CNN for Visual Question Answering

Lecture 13. Deep Belief Networks. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen

Query Intent Detection using Convolutional Neural Networks

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

Estimating Embedding Vectors for Queries

A Fast Learning Algorithm for Deep Belief Nets

Ruslan Salakhutdinov and Geoffrey Hinton. University of Toronto, Machine Learning Group IRGM Workshop July 2007

Support Vector Machines

arxiv: v1 [cs.cv] 6 Jul 2016

5 Learning hypothesis classes (16 points)

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Image Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction

Machine Learning

CS 6320 Natural Language Processing

DECISION TREES & RANDOM FORESTS X CONVOLUTIONAL NEURAL NETWORKS

CSC 578 Neural Networks and Deep Learning

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems

Learning to Rank with Attentive Media Attributes

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

ECG782: Multidimensional Digital Signal Processing

Probabilistic Siamese Network for Learning Representations. Chen Liu

Variable-Component Deep Neural Network for Robust Speech Recognition

Learning Dense Models of Query Similarity from User Click Logs

Fully Convolutional Networks for Semantic Segmentation

An Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation

Autoencoder. Representation learning (related to dictionary learning) Both the input and the output are x

A Neuro Probabilistic Language Model Bengio et. al. 2003

Supervised Models for Multimodal Image Retrieval based on Visual, Semantic and Geographic Information

What Causes My Test Alarm? Automatic Cause Analysis for Test Alarms in System and Integration Testing

Asynchronous Parallel Stochastic Gradient Descent. A Numeric Core for Scalable Distributed Machine Learning Algorithms

Weighted Suffix Tree Document Model for Web Documents Clustering

Model compression as constrained optimization, with application to neural nets

Natural Language Processing with Deep Learning CS224N/Ling284

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Transcription:

Representation Learning using Multi-Task Deep Neural Networks for Semantic Classification and Information Retrieval Xiaodong Liu 12, Jianfeng Gao 1, Xiaodong He 1 Li Deng 1, Kevin Duh 2, Ye-Yi Wang 1 1 Microsoft Research, USA 2 Nara Institute of Science and Technology, Japan

Learning Vector-Space Representation Why Significant accuracy gains in NLP tasks [Collobert+ 11] More compact models, easier to train and generalize better Existing learning methods are not optimal Use unsupervised objectives [Mikolov+ 11] Sub-optimal to the tasks of interest Use supervised objectives on a single task [Socher+ 13] Constrained by limited amounts of training data Our solution is inspired by multi-task learning [Caruana 97] 2

Multi-Task Deep Neural Nets for Representation Learning Leverage supervised data from many (related) tasks Reduce overfitting to a specific task Make the learned representations universal across tasks. Combine tasks as disparate as Semantic query classification, and Semantic web search Large scale experiments Higher accuracies on multiple tasks More compact models Easy to adapt to new tasks/domains 3

The Query Classification Task Given a search query Q, e.g., denver sushi downtown Identify its domain C e.g., Restaurant Hotel Nightlife Flight Thus, a search engine can tailor the interface and result to provide a richer personalized user experience

Problem Formulation For each domain C, build a binary classifier Input: represent a query Q as a vector of features x = [x 1, x n ] T Output: y = P 1 Q, C Q is labeled c is P 1 Q, C > 0.5 Input feature vector, e.g., a bag of words vector Regards words as atomic symbols: denver, sushi, downtown Each word is represented as a one-hot vector: 0,, 0,1,0,, 0 T Bag of words vector = sum of one-hot vectors Other (better) features: n-grams, phrases, (learned) topics, etc. How to construct optimal feature vectors for queries?

The Web Search Ranking Task Documents (D) Queries (Q) cold home remedy cold remeedy flu treatment how to deal with stuffy nose 6

Semantic Matching between Q and D Fuzzy keyword matching Q: cold home remedy D: best home remedies for cold and flu Spelling correction Q: cold remeedies D: best home remedies for cold and flu Query alteration/expansion Q: flu treatment D: best home remedies for cold and flu Query/document semantic matching Q: how to deal with stuffy nose D: best home remedies for cold and flu R&D progress 7

Problem Formulation Given a query Q, and a list of candidate docs D i, i = 1 N Rank D i according to their relevance to Q Represent Q and D as feature vectors, where features are Bag of words, phrases, (learned) topics, etc. Relevance cosine similarity of feature vectors of Q and D How to construct optimal feature vectors for queries and docs? 8

A DNN for Classification and DSSM for Ranking Classifier/Ranker that uses the hidden features as input Feature generation: project raw input features (bag of words) to hidden features (topics). Deep Structured Semantic Model (DSSM) [Huang+ 13] 9

The Proposed Multi-Task DNN Model 10

Shared Layers (l 1 and l 2 ) Word Hash Layer (l 1 ) Control the dimensionality of input using letter-3-gram e.g., cat #cat# #-c-a, c-a-t, a-t-# Only ~50K letter-trigrams in English; no OOV issue OOV words can be represented by letter-3-grams Spelling variations of the same word have similar representations Shared Semantic-Representation Layer (l 2 ) Captures cross-task semantic characteristics for arbitrary text (Q or D) l 2 = tanh(w 1 l 1 ) 11

Task Specific Representation (l 3 ) For each task, a nonlinear transformation maps l 2 into the task-specific representation via l 3 = tanh(w 2 t l 2 ) t denotes different tasks Model compactness result Compression from 500k-dim input to shared 300-dim semantic vector l 2 Multi-task DNN takes < 150KB in memory SVM using word-n-grams takes > 200MB Easy to add new domains, small memo footprint, fast runtime 12

Task-Specific Output Layers (P) Query classification: Q C 1 l 3 = tanh(w 2 t=c 1 l 2 ) P C 1 Q = sigmoid(w 3 t=c 1 Q C 1) Web search ranking Q and D are mapped into task representation Q S q and D S d. Relevance score is computed by cosine similarity as 13

The Training Procedure: Mini-Batch SGD i.e., cross-entropy loss i.e., pair-wise rank loss 14

Pair-Wise Rank Loss for Web Search Consider a query Q and two documents D + and D Assume D + is more relevant than D to Q sim θ Q, D is the cosine similarity of Q and D in semantic space, mapped by a neural network parameterized by θ Δ = sim θ Q, D + sim θ Q, D We want to maximize Δ 20 15 10 5 L Δ; θ = log(1 + exp γδ ) 0-2 -1 0 1 2

Experimental Evaluation Metrics AUC scores for query classification NDCG scores for web search ranking 16

Query Classification AUC Results MT-DNN > DNN: usefulness of multi-task objective over single-task objective DNN/MT-DNN > SVM-Letter w/ the same input l 1 : importance of learning a semantic representation l 2 DNN/MT-DNN > SVM-Word: power of deep learning 17

Web Search NDCG Results 18

Domain Adaptation on Query Classification To add a new task, how much training data to label? Experiment design Select one query classification task t, train MT-DNN on the remaining tasks to obtain a semantic representation (l 2 ) Given a fixed l 2, train a SVM on the training data of t, using varying amounts of labels Evaluate the AUC on the test data of t Compare 3 SVM classifiers trained using different feature vectors Semantic Representation (l 2 ) Word-n-grams, n = 1,2,3 Letter-3-grams 19

Domain Adaptation in Query Classification Using l 2 features, only small amounts of training labels are needed l 2 features are universally useful across domains/tasks 20

Conclusion Learning semantic representation using multi-task DNN Combine tasks as disparate as classification and ranking Consistently outperforms strong baselines Leads to a compact model Facilitates domain adaptation using learned representations Are the learned representations really semantic? What DNN learns are hidden features that are useful for a particular task? Semantic representations are universal in that they are useful for multiple tasks Multi-task DNN is a way to learn universal, semantic representations 21

Thanks! Q&A