Ruslan Salakhutdinov and Geoffrey Hinton. University of Toronto, Machine Learning Group IRGM Workshop July 2007

Similar documents
Neural Networks for Machine Learning. Lecture 15a From Principal Components Analysis to Autoencoders

arxiv: v1 [cs.cl] 18 Jan 2015

A Fast Learning Algorithm for Deep Belief Nets

Neural Networks for unsupervised learning From Principal Components Analysis to Autoencoders to semantic hashing

Energy Based Models, Restricted Boltzmann Machines and Deep Networks. Jesse Eickholt

Information Retrieval. (M&S Ch 15)

Lecture 13. Deep Belief Networks. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Restricted Boltzmann Machines. Shallow vs. deep networks. Stacked RBMs. Boltzmann Machine learning: Unsupervised version

Chapter 6: Information Retrieval and Web Search. An introduction

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Learning Representations for Multimodal Data with Deep Belief Nets

Depth Image Dimension Reduction Using Deep Belief Networks

Multimedia Information Systems

Stacked Denoising Autoencoders for Face Pose Normalization

Tag-based Social Interest Discovery

CSE 494: Information Retrieval, Mining and Integration on the Internet

ECCV Presented by: Boris Ivanovic and Yolanda Wang CS 331B - November 16, 2016

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

Information Retrieval: Retrieval Models

CS294-1 Final Project. Algorithms Comparison

3D Object Recognition with Deep Belief Nets

Modeling image patches with a directed hierarchy of Markov random fields

Day 3 Lecture 1. Unsupervised Learning

Introduction to Information Retrieval

Tutorial Deep Learning : Unsupervised Feature Learning

Behavioral Data Mining. Lecture 18 Clustering

Total Recall: Automatic Query Expansion with a Generative Feature Model for Object Retrieval O. Chum, et al.

CSCI 599: Applications of Natural Language Processing Information Retrieval Retrieval Models (Part 1)"

SPE MS. Abstract. Introduction. Autoencoders

Static Gesture Recognition with Restricted Boltzmann Machines

DEEP LEARNING TO DIVERSIFY BELIEF NETWORKS FOR REMOTE SENSING IMAGE CLASSIFICATION

Deep Boltzmann Machines

Introduction to Information Retrieval

An Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation

Implicit Mixtures of Restricted Boltzmann Machines

Query by Singing/Humming System Based on Deep Learning

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

An Arabic Optical Character Recognition System Using Restricted Boltzmann Machines

Autoencoder. Representation learning (related to dictionary learning) Both the input and the output are x

Deep Learning Applications

Problem 1: Complexity of Update Rules for Logistic Regression

ECS289: Scalable Machine Learning

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Handwritten Hindi Numerals Recognition System

Task Description: Finding Similar Documents. Document Retrieval. Case Study 2: Document Retrieval

Using Semantic Similarity in Crawling-based Web Application Testing. (National Taiwan Univ.)

Neural Networks and Deep Learning

Recommendation of APIs for Mashup creation

Text Analytics (Text Mining)

modern database systems lecture 4 : information retrieval

Mining Web Data. Lijun Zhang

VK Multimedia Information Systems

Last Week: Visualization Design II

Modeling pigeon behaviour using a Conditional Restricted Boltzmann Machine

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Deep Belief Nets (An updated and extended version of my 2007 NIPS tutorial)

CS54701: Information Retrieval

Advanced Introduction to Machine Learning, CMU-10715

Learning Two-Layer Contractive Encodings

Probabilistic Graphical Models Part III: Example Applications

Deep Belief Networks for phone recognition

Recap: lecture 2 CS276A Information Retrieval

Grounded Compositional Semantics for Finding and Describing Images with Sentences

Efficient Feature Learning Using Perturb-and-MAP

Fall CS646: Information Retrieval. Lecture 2 - Introduction to Search Result Ranking. Jiepu Jiang University of Massachusetts Amherst 2016/09/12

Semantic Search in s

A supervised strategy for deep kernel machine

Gated Boltzmann Machine in Texture Modeling

CSE 6242 A / CS 4803 DVA. Feb 12, Dimension Reduction. Guest Lecturer: Jaegul Choo

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

vector space retrieval many slides courtesy James Amherst

Design and Implementation of Search Engine Using Vector Space Model for Personalized Search

Chapter 27 Introduction to Information Retrieval and Web Search

Text Analytics (Text Mining)

Part 12: Advanced Topics in Collaborative Filtering. Francesco Ricci

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Deep Model Compression

Analysis and Latent Semantic Indexing

Lecture 24: Image Retrieval: Part II. Visual Computing Systems CMU , Fall 2013

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

DOCUMENT INDEXING USING INDEPENDENT TOPIC EXTRACTION. Yu-Hwan Kim and Byoung-Tak Zhang

ECS 189G: Intro to Computer Vision, Spring 2015 Problem Set 3

BUAA AUDR at ImageCLEF 2012 Photo Annotation Task

Text Categorization. Foundations of Statistic Natural Language Processing The MIT Press1999

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2016

Reading group on Ontologies and NLP:

Representation Learning using Multi-Task Deep Neural Networks for Semantic Classification and Information Retrieval

Classic IR Models 5/6/2012 1

CS435 Introduction to Big Data Spring 2018 Colorado State University. 3/21/2018 Week 10-B Sangmi Lee Pallickara. FAQs. Collaborative filtering

JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

FastText. Jon Koss, Abhishek Jindal

Music Recommendation with Implicit Feedback and Side Information

Boolean Model. Hongning Wang

Novel Lossy Compression Algorithms with Stacked Autoencoders

Instructor: Stefan Savev

Towards Understanding Latent Semantic Indexing. Second Reader: Dr. Mario Nascimento

Content-based Dimensionality Reduction for Recommender Systems

Visualization of Text Document Corpus

Transcription:

SEMANIC HASHING Ruslan Salakhutdinov and Geoffrey Hinton University of oronto, Machine Learning Group IRGM orkshop July 2007

Existing Methods One of the most popular and widely used in practice algorithms for document retrieval tasks is F-IDF. However: It computes document similarity directly in the word-count space, which can be slow for large vocabularies. It assumes that the counts of different words provide independent evidence of similarity. It makes no use of semantic similarities between words. o overcome these drawbacks, models for capturing low-dimensional, latent representations have been proposed and successfully applied in the domain of information retrieval. One such simple and widely-used method is Latent Semantic Analysis (LSA), which extracts low-dimensional semantic structure using SVD to get a low-rank approximation of the word-document co-occurrence matrix.

Drawbacks of Existing Methods LSA is a linear method so it can only capture pairwise correlations between words. e need something more powerful. Numerous methods, in particular probabilistic versions of LSA were introduced in the machine learning community. hese models can be viewed as graphical models in which a single layer of hidden topic variables have directed connections to variables that represent word-counts. here are limitations on the types of structure that can be represented efficiently by a single layer of hidden variables. Recently, Hinton et al. have discovered a way to perform fast, greedy learning of deep, directed belief nets one layer at a time. e will use this idea to build a network with multiple hidden layers and with millions of parameters and show that it can discover latent representations that work much better.

Learning multiple layers A single layer of features generally cannot perfectly model the structure in the data. e will use a Restricted Boltzmann Machine (), which is a two-layer undirected graphical model, as our building block. Perform greedy, layer-by-layer learning: Learn and Freeze reat the learned features, driven by the training data as if they were data. Learn and Freeze 2. Greedily learn as many layers of features as desired.. Each layer of features captures strong high-order correlations between the activities of units in the layer below. 30 32 4 3 2

for count data Hidden units remain binary and the visible word counts are modeled by the Constrained Poisson Model. h j he energy is defined as: E(v, h) = i b iv i j b jh j i,j v ih j w ij + i v i log Z N + i log Γ(v i + ) v bias i Conditional distributions over hidden and visible units are: p(h j = v) = + exp( b j i w ijv i ) ( exp (λi + j p(v i = n h) = Poisson h ) jw ij ) k exp ( λ k + j h )N jw kj where N is the total length of the document.

he Big Picture 32 3 2 Gaussian Noise 2 3 4 3 3 2 2 6 5 32 Code Layer Recursive Pretraining Fine tuning

Reuters Corpus: Learning 2-D code space Autoencoder 2 D opic Space Interbank Markets LSA 2 D opic Space European Community Monetary/Economic Energy Markets Disasters and Accidents Leading Ecnomic Indicators Accounts/ Earnings Legal/Judicial Government Borrowings e use a --250-25-2 autoencoder to convert test documents into a two-dimensional code. he Reuters Corpus Volume II contains 804,44 newswire stories (randomly split into 402,207 training and 402,207 test). e used a simple bag-of-words representation. Each article is represented as a vector containing the counts of the most frequently used words in the training dataset.

Results for 0-D codes e use the cosine of the angle between two codes as a measure of similarity. Precision-recall curves when a 0-D query document from the test set is used to retrieve other test set documents, averaged over 402,207 possible queries. 50 40 Autoencoder 0D LSA 0D LSA 50D Autoencoder 0D prior to fine tuning Precision (%) 30 20 0 0. 0.2 0.4 0.8.6 3.2 6.4 2.8 25.6 5.2 00 Recall (%)

Semantic Hashing 32 3 2 Gaussian Noise 2 3 4 3 3 2 2 6 5 32 Code Layer Recursive Pretraining Fine tuning Learn to map documents into semantic 32-D binary code and use these codes as memory addresses. e have the ultimate retrieval tool: Given a query document, compute its 32-bit address and retrieve all of the documents stored at the similar addresses with no search at all.

he Main Idea of Semantic Hashing Memory Semantically Similar Documents f Document

Semantic Hashing European Community Monetary/Economic Reuters 2 D Embedding of 20 bit codes Disasters and Accidents 50 40 F IDF F IDF using 20 bits Locality Sensitive Hashing Precision (%) 30 20 Energy Markets Government Borrowing 0 Accounts/Earnings 0 0. 0.2 0.4 0.8.6 3.2 6.4 2.8 25.6 5.2 00 Recall (%) e used a simple C implementation on Reuters dataset (402,22 training and 402,22 test documents). For a given query, it takes about 0.5 milliseconds to create a short-list of about 3,000 semantically similar documents. It then takes 0 milliseconds to retrieve the top few matches from that short-list using F-IDF, and it is more accurate than full F-IDF. Locality-Sensitive Hashing takes about milliseconds, and is less accurate. Our method is 50 times faster than the fastest existing method and is more accurate.