From Word Embeddings To Document Distances. Matt J. Kusner Yu Sun Nicholas I. Kolkin Kilian Q. Weinberger

Size: px
Start display at page:

Download "From Word Embeddings To Document Distances. Matt J. Kusner Yu Sun Nicholas I. Kolkin Kilian Q. Weinberger"

Transcription

1 From Word Embeddings To Document Distances Matt J. Kusner Yu Sun Nicholas I. Kolkin Kilian Q. Weinberger

2 Goal: a distance between two documents?

3 Applications document classification multi-lingual document matching song identification

4 Word Embedding word2vec [Mikolov et al., 213] different from [Collobert & Weston, 28] [Mnih & Hinton, 29] word2vec is not deep! words trained on 1 billion words 3 million different words embedded R d

5 Word Embedding word2vec [Mikolov et al., 213] x i x j X 2 R d n words distance between words i and j: kx i x j k 2 is roughly their dissimilarity R d

6 Word Embedding word2vec [Mikolov et al., 213] Man King Woman Queen R d

7 How can we leverage these high quality word embeddings to compute document distances?

8 Word Mover s Distance cat run tree dog steamocean car fly race frog pattern oatpickle rock up rocket win bunnybaby

9 Goal?

10 Word Mover s Distance d d media media R d word embedding

11 Word Mover s Distance d d media media R d word embedding

12 Word Mover s Distance word mover s distance = d minimum word distance to transform mass d into d d media media R d word embedding

13 Word Mover s Distance d d media media R d word embedding

14 Word Mover s Distance d d media media R d word embedding

15 Word Mover s Distance d d media media R d word embedding

16 Word Mover s Distance d media j d T i R d word embedding

17 Word Mover s Distance d d 1/2 R d word embedding

18 Word Mover s Distance d d 1/2 R d word embedding

19 Word Mover s Distance d d 1/2 R d word embedding

20 Word Mover s Distance d WMD(d, d ), d 1/2 min T i,j=1 T ij kx i x j k 2 1/2 x j x k x i

21 Word Mover s Distance d WMD(d, d ) ),, d 1/2 min T i,j=1 T ij kx i x j k 2 s.t. T ij = d i 8i j=1 i=1 T ij = d j 8j

22 Remarks WMD(d, d ), min T ij kx i x j k 2 T s.t. s.t. i,j=1 j=1 i=1 T ij = d i F ij = d i T ij = d j i=1 F ij = d j 8i 8j 8j in CV this is the Earth Mover s Distance (EMD) [Rubner et al., 1998] an old optimal transport problem [Monge, 1781]

23 How well does WMD perform on document classification via k- nearest neighbors (k-nn)?

24 Classic Approaches bag-of-words campaign speech Washington TF-IDF LSI speech speech a a LDA [Salton & Buckley, 1988] [Deerwester et al., 199] [Blei et al., 23] topic distributions Civil War music politics sports politics topic guitar soccer Vicksburg Madonna football Washington speech

25 Results: k-nn test error % train inputs: BOW dim: k-nearest neighbor error bbcsport twitter recipe ohsumed classic reuters amazon Okapi BM25 [Robertson & Walker, 1994] TF-IDF [Jones, 1972] BOW [Frakes & Baeza-Yates, 1992] Componential Counting Grid [erina et al., 213] msda [Chen et al., 212] LDA [Blei et al., 23] LSI [Deerwester et al., 199] Word Mover's Distance news All hyper-parameters set with bayesopt.m [Gardner et al. 214]

26 Results: k-nn average error w.r.t. BOW Okapi BM25 TF-IDF BOW CCG msda LDA LSI.72 WMD

27 Computational Complexity min T F i,j=1 i,j=1 s.t. s.t. WMD(d, d ), j=1 i=1 TF ij kx i x j k 2 T ij = d i F ij = d i T ij = d j i=1 F ij = d j L with 2n constraints O(n 3 log n) 8i 8j 8j [ele & Werman, 29]

28 Computational Complexity min T F i,j=1 i,j=1 s.t. s.t. WMD(d, d ), j=1 i=1 TF ij kx i x j k 2 T ij = d i F ij = d i T ij = d j i=1 F ij = d j 8i 8j 8j approximations: [Rubner et al., 1998]; [Levina & Bickel, 21]; [Grauman & Darrell, 24]; [Shirdhonkar & Jacobs, 28]

29 Approximation 1 [Rubner et al., 1998] d d media media R d word embedding

30 Approximation 1 [Rubner et al., 1998] d d media media R d word embedding

31 d media Approximation 1 Word Centroid Distance WCD(d, d ), kxd Xd k 2 Xd Xd O(nd) [Rubner et al., 1998] media d R d word embedding

32 Faster Approximations for a random test input... distance twitter training input index amazon WCD RWMD WMD training input index

33 Approximation 2 min T F i,j=1 i,j=1 s.t. j=1 i=1 FT ij ij kx ii x jj k 22 T ij = d i F ij = d i T ij = d j i=1 F ij = d j 8i 8i 8j 8j

34 Approximation 2 min T s.t. i,j=1 j=1 i=1 T ij kx i x j k 2 T ij = d i T ij = d j 8i 8j D 1

35 Approximation 2 min T s.t. i,j=1 j=1 i=1 T ij kx i x j k 2 T ij = d i T ij = d j 8i 8j D 1 just a nearest-neighbor search!

36 Approximation 2 min T s.t. i,j=1 j=1 i=1 T ij kx i x j k 2 T ij = d i T ij = d j 8i 8j D 2 just a nearest-neighbor search!

37 Approximation 2 min T s.t. i,j=1 T ij kx i x j k 2 T ij = d i j=1 T ij = d j i=1 8i 8j min T s.t. i,j=1 T ij kx i x j k 2 T ij = d i j=1 T ij = d j i=1 8i 8j D 1 D 2 Relaxed Word Mover s Distance RWMD(d, d ), max(d 1,D 2 ) O(n 2 d)

38 Faster Approximations for a random test input... distance twitter training input index amazon WCD RWMD WMD training input index

39 Faster Approximations for a random test input... distance twitter training input index amazon WCD RWMD WMD training input index

40 Faster Approximations MD average knn error w.r.t. BOW WMD RWMD 1 c2 RWMD D 1 D 2 c1 WCD RWMD WMD

41 Other Embeddings

42 Conclusion Word Mover s Distance: media document distances from word embeddings Very accurate as it leverages high quality word2vec embedding average error w.r.t. BOW Okapi BM25 TF-IDF BOW CCG msda LDA LSI.72 WMD Fast through approximations WMD O(n 3 log n) WCD O(nd) RWMD O(n 2 d)

43 Code: Thank you. Questions?

arxiv: v1 [cs.ir] 20 Nov 2017

arxiv: v1 [cs.ir] 20 Nov 2017 Linear-Complexity Relaxed Word Mover s Distance with GPU Acceleration Kubilay Atasu, Thomas Parnell, Celestine Dünner, Manolis Sifalakis, Haralampos Pozidis, Vasileios Vasileiadis, Michail Vlachos, Cesar

More information

Distribution Distance Functions

Distribution Distance Functions COMP 875 November 10, 2009 Matthew O Meara Question How similar are these? Outline Motivation Protein Score Function Object Retrieval Kernel Machines 1 Motivation Protein Score Function Object Retrieval

More information

Earth Mover s Distance and The Applications

Earth Mover s Distance and The Applications Earth Mover s Distance and The Applications Hu Ding Computer Science and Engineering, Michigan State University The Motivations It is easy to compare two single objects: the pairwise distance. For example:

More information

Fusing Document, Collection and Label Graph-based Representations with Word Embeddings for Text Classification. June 8, 2018

Fusing Document, Collection and Label Graph-based Representations with Word Embeddings for Text Classification. June 8, 2018 Fusing Document, Collection and Label Graph-based Representations with Word Embeddings for Text Classification Konstantinos Skianis École Polytechnique France Fragkiskos D. Malliaros CentraleSupélec &

More information

WORD MOVER S EMBEDDING: FROM WORD2VEC TO DOCUMENT EMBEDDING

WORD MOVER S EMBEDDING: FROM WORD2VEC TO DOCUMENT EMBEDDING WORD MOVER S EMBEDDING: FROM WORD2VEC TO DOCUMENT EMBEDDING Anonymous authors Paper under double-blind review ABSTRACT Learning effective text representations is a key foundation for numerous machine learning

More information

Fast and Robust Earth Mover s Distances

Fast and Robust Earth Mover s Distances Fast and Robust Earth Mover s Distances Ofir Pele and Michael Werman School of Computer Science and Engineering The Hebrew University of Jerusalem {ofirpele,werman}@cs.huji.ac.il Abstract We present a

More information

Word Embeddings in Search Engines, Quality Evaluation. Eneko Pinzolas

Word Embeddings in Search Engines, Quality Evaluation. Eneko Pinzolas Word Embeddings in Search Engines, Quality Evaluation Eneko Pinzolas Neural Networks are widely used with high rate of success. But can we reproduce those results in IR? Motivation State of the art for

More information

Supervised Hashing for Image Retrieval via Image Representation Learning

Supervised Hashing for Image Retrieval via Image Representation Learning Supervised Hashing for Image Retrieval via Image Representation Learning Rongkai Xia, Yan Pan, Cong Liu (Sun Yat-Sen University) Hanjiang Lai, Shuicheng Yan (National University of Singapore) Finding Similar

More information

Using Centroids of Word Embeddings and Word Mover s Distance for Biomedical Document Retrieval in Question Answering

Using Centroids of Word Embeddings and Word Mover s Distance for Biomedical Document Retrieval in Question Answering Using Centroids of Word Embeddings and Word Mover s Distance for Biomedical Document Retrieval in Question Answering Georgios-Ioannis Brokos 1, Prodromos Malakasiotis 1,2 and Ion Androutsopoulos 1,2 1

More information

Structured Optimal Transport

Structured Optimal Transport Structured Optimal Transport David Alvarez-Melis, Tommi Jaakkola, Stefanie Jegelka CSAIL, MIT OTML Workshop @ NIPS, Dec 9th 2017 Motivation: Domain Adaptation c(x i,y j ) c(x k,y`) Labeled Source Domain

More information

Exploring the Structure of Data at Scale. Rudy Agovic, PhD CEO & Chief Data Scientist at Reliancy January 16, 2019

Exploring the Structure of Data at Scale. Rudy Agovic, PhD CEO & Chief Data Scientist at Reliancy January 16, 2019 Exploring the Structure of Data at Scale Rudy Agovic, PhD CEO & Chief Data Scientist at Reliancy January 16, 2019 Outline Why exploration of large datasets matters Challenges in working with large data

More information

Semantic Matching by Non-Linear Word Transportation for Information Retrieval

Semantic Matching by Non-Linear Word Transportation for Information Retrieval Semantic Matching by Non-Linear Word Transportation for Information Retrieval Jiafeng Guo, Yixing Fan, Qingyao Ai, W. Bruce Croft CAS Key Lab of Network Data Science and Technology, Institute of Computing

More information

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! (301) 219-4649 james.mayfield@jhuapl.edu What is Information Retrieval? Evaluation

More information

Metric Learning for Large Scale Image Classification:

Metric Learning for Large Scale Image Classification: Metric Learning for Large Scale Image Classification: Generalizing to New Classes at Near-Zero Cost Thomas Mensink 1,2 Jakob Verbeek 2 Florent Perronnin 1 Gabriela Csurka 1 1 TVPA - Xerox Research Centre

More information

Image classification Computer Vision Spring 2018, Lecture 18

Image classification Computer Vision Spring 2018, Lecture 18 Image classification http://www.cs.cmu.edu/~16385/ 16-385 Computer Vision Spring 2018, Lecture 18 Course announcements Homework 5 has been posted and is due on April 6 th. - Dropbox link because course

More information

VECTOR SPACE CLASSIFICATION

VECTOR SPACE CLASSIFICATION VECTOR SPACE CLASSIFICATION Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. Chapter 14 Wei Wei wwei@idi.ntnu.no Lecture

More information

(Multinomial) Logistic Regression + Feature Engineering

(Multinomial) Logistic Regression + Feature Engineering -6 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University (Multinomial) Logistic Regression + Feature Engineering Matt Gormley Lecture 9 Feb.

More information

Announcements. HW3 problem 4c Kevin Jamieson

Announcements. HW3 problem 4c Kevin Jamieson Announcements HW3 problem 4c 2017 Kevin Jamieson 1 Announcements HW3 problem 4c 2017 Kevin Jamieson 2 Announcements HW3 problem 4c 2017 Kevin Jamieson 3 Sequences and Recurrent Neural Networks Machine

More information

Alexey Grigorev Team ololobhi (Abhishek & ololo)

Alexey Grigorev Team ololobhi (Abhishek & ololo) Alexey Grigorev Team ololobhi (Abhishek & ololo) Data set ~3 mln train pairs, ~1 mln test pairs ~10.8 mln images (~45 gb) Target Evaluation metric: AUC Category_ID Title Pictures Price No seller data locationid

More information

Word importance-based similarity of documents metric (WISDM)

Word importance-based similarity of documents metric (WISDM) Word importance-based similarity of documents metric (WISDM) [Fast and scalable document similarity metric for analysis of scientific documents] Viktor Botev IRIS.AI Bekkestua, Norway victor@iris.ai Kaloyan

More information

Cluster Analysis: Agglomerate Hierarchical Clustering

Cluster Analysis: Agglomerate Hierarchical Clustering Cluster Analysis: Agglomerate Hierarchical Clustering Yonghee Lee Department of Statistics, The University of Seoul Oct 29, 2015 Contents 1 Cluster Analysis Introduction Distance matrix Agglomerative Hierarchical

More information

Nearest-Neighbor Search in NLP Applications using the Non-Metric Space Library (NMSLIB)

Nearest-Neighbor Search in NLP Applications using the Non-Metric Space Library (NMSLIB) Nearest-Neighbor Search in NLP Applications using the Non-Metric Space Library (NMSLIB) Leo (Leonid) Boytsov https://github.com/searchivarius/nonmetricspacelib Nearest-Neighbor Search in NLP Applications

More information

Deep Learning for Program Analysis. Lili Mou January, 2016

Deep Learning for Program Analysis. Lili Mou January, 2016 Deep Learning for Program Analysis Lili Mou January, 2016 Outline Introduction Background Deep Neural Networks Real-Valued Representation Learning Our Models Building Program Vector Representations for

More information

Text Analytics (Text Mining)

Text Analytics (Text Mining) CSE 6242 / CX 4242 Apr 1, 2014 Text Analytics (Text Mining) Concepts and Algorithms Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey Heer,

More information

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin Clustering K-means Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, 2014 Carlos Guestrin 2005-2014 1 Clustering images Set of Images [Goldberger et al.] Carlos Guestrin 2005-2014

More information

CSE 6242 / CX October 9, Dimension Reduction. Guest Lecturer: Jaegul Choo

CSE 6242 / CX October 9, Dimension Reduction. Guest Lecturer: Jaegul Choo CSE 6242 / CX 4242 October 9, 2014 Dimension Reduction Guest Lecturer: Jaegul Choo Volume Variety Big Data Era 2 Velocity Veracity 3 Big Data are High-Dimensional Examples of High-Dimensional Data Image

More information

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda 1 Observe novel applicability of DL techniques in Big Data Analytics. Applications of DL techniques for common Big Data Analytics problems. Semantic indexing

More information

Supervised classification of law area in the legal domain

Supervised classification of law area in the legal domain AFSTUDEERPROJECT BSC KI Supervised classification of law area in the legal domain Author: Mees FRÖBERG (10559949) Supervisors: Evangelos KANOULAS Tjerk DE GREEF June 24, 2016 Abstract Search algorithms

More information

Bag of Words Models. CS4670 / 5670: Computer Vision Noah Snavely. Bag-of-words models 11/26/2013

Bag of Words Models. CS4670 / 5670: Computer Vision Noah Snavely. Bag-of-words models 11/26/2013 CS4670 / 5670: Computer Vision Noah Snavely Bag-of-words models Object Bag of words Bag of Words Models Adapted from slides by Rob Fergus and Svetlana Lazebnik 1 Object Bag of words Origin 1: Texture Recognition

More information

Object Classification Problem

Object Classification Problem HIERARCHICAL OBJECT CATEGORIZATION" Gregory Griffin and Pietro Perona. Learning and Using Taxonomies For Fast Visual Categorization. CVPR 2008 Marcin Marszalek and Cordelia Schmid. Constructing Category

More information

Introduction to Information Retrieval. (COSC 488) Spring Nazli Goharian. Course Outline

Introduction to Information Retrieval. (COSC 488) Spring Nazli Goharian. Course Outline Introduction to Information Retrieval (COSC 488) Spring 2012 Nazli Goharian nazli@cs.georgetown.edu Course Outline Introduction Retrieval Strategies (Models) Retrieval Utilities Evaluation Indexing Efficiency

More information

Multi-Dimensional Text Classification

Multi-Dimensional Text Classification Multi-Dimensional Text Classification Thanaruk THEERAMUNKONG IT Program, SIIT, Thammasat University P.O. Box 22 Thammasat Rangsit Post Office, Pathumthani, Thailand, 12121 ping@siit.tu.ac.th Verayuth LERTNATTEE

More information

CPSC 340: Machine Learning and Data Mining. Recommender Systems Fall 2017

CPSC 340: Machine Learning and Data Mining. Recommender Systems Fall 2017 CPSC 340: Machine Learning and Data Mining Recommender Systems Fall 2017 Assignment 4: Admin Due tonight, 1 late day for Monday, 2 late days for Wednesday. Assignment 5: Posted, due Monday of last week

More information

PTE : Predictive Text Embedding through Large-scale Heterogeneous Text Networks

PTE : Predictive Text Embedding through Large-scale Heterogeneous Text Networks PTE : Predictive Text Embedding through Large-scale Heterogeneous Text Networks Pramod Srinivasan CS591txt - Text Mining Seminar University of Illinois, Urbana-Champaign April 8, 2016 Pramod Srinivasan

More information

Exploring Semantic Concept Using Local Invariant Features

Exploring Semantic Concept Using Local Invariant Features Exploring Semantic Concept Using Local Invariant Features Yu-Gang Jiang, Wan-Lei Zhao and Chong-Wah Ngo Department of Computer Science City University of Hong Kong, Kowloon, Hong Kong {yjiang,wzhao2,cwngo}@cs.cityu.edu.h

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction 2 Machine Learning Field of study that gives computers the ability to learn without

More information

Making Retrieval Faster Through Document Clustering

Making Retrieval Faster Through Document Clustering R E S E A R C H R E P O R T I D I A P Making Retrieval Faster Through Document Clustering David Grangier 1 Alessandro Vinciarelli 2 IDIAP RR 04-02 January 23, 2004 D a l l e M o l l e I n s t i t u t e

More information

Descriptors for CV. Introduc)on:

Descriptors for CV. Introduc)on: Descriptors for CV Content 2014 1.Introduction 2.Histograms 3.HOG 4.LBP 5.Haar Wavelets 6.Video based descriptor 7.How to compare descriptors 8.BoW paradigm 1 2 1 2 Color RGB histogram Introduc)on: Image

More information

Efficient Similarity Search in Scientific Databases with Feature Signatures

Efficient Similarity Search in Scientific Databases with Feature Signatures DATA MANAGEMENT AND DATA EXPLORATION GROUP Prof. Dr. rer. nat. Thomas Seidl DATA MANAGEMENT AND DATA EXPLORATION GROUP Prof. Dr. rer. nat. Thomas Seidl Efficient Similarity Search in Scientific Databases

More information

Automatic Classification of Audio Data

Automatic Classification of Audio Data Automatic Classification of Audio Data Carlos H. C. Lopes, Jaime D. Valle Jr. & Alessandro L. Koerich IEEE International Conference on Systems, Man and Cybernetics The Hague, The Netherlands October 2004

More information

A Case Study on the Impact of Similarity Measure on Information Retrieval based Software Engineering Tasks

A Case Study on the Impact of Similarity Measure on Information Retrieval based Software Engineering Tasks Noname manuscript No. (will be inserted by the editor) A Case Study on the Impact of Similarity Measure on Information Retrieval based Software Engineering Tasks Md Masudur Rahman Saikat Chakraborty Gail

More information

Metric Learning for Large-Scale Image Classification:

Metric Learning for Large-Scale Image Classification: Metric Learning for Large-Scale Image Classification: Generalizing to New Classes at Near-Zero Cost Florent Perronnin 1 work published at ECCV 2012 with: Thomas Mensink 1,2 Jakob Verbeek 2 Gabriela Csurka

More information

CS6670: Computer Vision

CS6670: Computer Vision CS6670: Computer Vision Noah Snavely Lecture 16: Bag-of-words models Object Bag of words Announcements Project 3: Eigenfaces due Wednesday, November 11 at 11:59pm solo project Final project presentations:

More information

Marginalized Denoising Autoencoder via Graph Regularization for Domain Adaptation

Marginalized Denoising Autoencoder via Graph Regularization for Domain Adaptation Marginalized Denoising Autoencoder via Graph Regularization for Domain Adaptation Yong Peng, Shen Wang 2, and Bao-Liang Lu,3, Center for Brain-Like Computing and Machine Intelligence, Department of Computer

More information

Text Analytics (Text Mining)

Text Analytics (Text Mining) CSE 6242 / CX 4242 Text Analytics (Text Mining) Concepts and Algorithms Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko,

More information

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin Clustering K-means Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, 2014 Carlos Guestrin 2005-2014 1 Clustering images Set of Images [Goldberger et al.] Carlos Guestrin 2005-2014

More information

Text Analytics (Text Mining)

Text Analytics (Text Mining) CSE 6242 / CX 4242 Text Analytics (Text Mining) Concepts, Algorithms, LSI/SVD Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John

More information

Lecture 8 May 7, Prabhakar Raghavan

Lecture 8 May 7, Prabhakar Raghavan Lecture 8 May 7, 2001 Prabhakar Raghavan Clustering documents Given a corpus, partition it into groups of related docs Recursively, can induce a tree of topics Given the set of docs from the results of

More information

Large-scale visual recognition Efficient matching

Large-scale visual recognition Efficient matching Large-scale visual recognition Efficient matching Florent Perronnin, XRCE Hervé Jégou, INRIA CVPR tutorial June 16, 2012 Outline!! Preliminary!! Locality Sensitive Hashing: the two modes!! Hashing!! Embedding!!

More information

Text Analytics (Text Mining)

Text Analytics (Text Mining) http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Text Analytics (Text Mining) Concepts, Algorithms, LSI/SVD Duen Horng (Polo) Chau Assistant Professor Associate Director, MS

More information

Term Frequency With Average Term Occurrences For Textual Information Retrieval

Term Frequency With Average Term Occurrences For Textual Information Retrieval Noname manuscript No. (will be inserted by the editor) Term Frequency With Average Term Occurrences For Textual Information Retrieval O. Ibrahim D. Landa-Silva Received: date / Accepted: date Abstract

More information

on learned visual embedding patrick pérez Allegro Workshop Inria Rhônes-Alpes 22 July 2015

on learned visual embedding patrick pérez Allegro Workshop Inria Rhônes-Alpes 22 July 2015 on learned visual embedding patrick pérez Allegro Workshop Inria Rhônes-Alpes 22 July 2015 Vector visual representation Fixed-size image representation High-dim (100 100,000) Generic, unsupervised: BoW,

More information

Towards Large-Scale Semantic Representations for Actionable Exploitation. Prof. Trevor Darrell UC Berkeley

Towards Large-Scale Semantic Representations for Actionable Exploitation. Prof. Trevor Darrell UC Berkeley Towards Large-Scale Semantic Representations for Actionable Exploitation Prof. Trevor Darrell UC Berkeley traditional surveillance sensor emerging crowd sensor Desired capabilities: spatio-temporal reconstruction

More information

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues COS 597A: Principles of Database and Information Systems Information Retrieval Traditional database system Large integrated collection of data Uniform access/modifcation mechanisms Model of data organization

More information

Word2vec and beyond. presented by Eleni Triantafillou. March 1, 2016

Word2vec and beyond. presented by Eleni Triantafillou. March 1, 2016 Word2vec and beyond presented by Eleni Triantafillou March 1, 2016 The Big Picture There is a long history of word representations Techniques from information retrieval: Latent Semantic Analysis (LSA)

More information

Clustering. Huanle Xu. Clustering 1

Clustering. Huanle Xu. Clustering 1 Clustering Huanle Xu Clustering 1 High Dimensional Data Given a cloud of data points we want to understand their structure 10/31/2016 Clustering 4 The Problem of Clustering Given a set of points, with

More information

Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan

Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan Today s topic CS347 Clustering documents Lecture 8 May 7, 2001 Prabhakar Raghavan Why cluster documents Given a corpus, partition it into groups of related docs Recursively, can induce a tree of topics

More information

Neston High School Mathematics Faculty Homework Booklet

Neston High School Mathematics Faculty Homework Booklet Neston High School Mathematics Faculty Homework Booklet Year 11 Sets 4, 5, 6 Scheme: Foundation Homework Sheet 1 Week Commencing 11 th December 2017 1: Reflect the shape in the given mirror line. C11:

More information

Kristina Lerman University of Southern California. This lecture is partly based on slides prepared by Anon Plangprasopchok

Kristina Lerman University of Southern California. This lecture is partly based on slides prepared by Anon Plangprasopchok Kristina Lerman University of Southern California This lecture is partly based on slides prepared by Anon Plangprasopchok Social Web is a platform for people to create, organize and share information Users

More information

A probabilistic description-oriented approach for categorising Web documents

A probabilistic description-oriented approach for categorising Web documents A probabilistic description-oriented approach for categorising Web documents Norbert Gövert Mounia Lalmas Norbert Fuhr University of Dortmund {goevert,mounia,fuhr}@ls6.cs.uni-dortmund.de Abstract The automatic

More information

Word Embedding for Social Book Suggestion

Word Embedding for Social Book Suggestion Word Embedding for Social Book Suggestion Nawal Ould-Amer 1, Philippe Mulhem 1, Mathias Géry 2, and Karam Abdulahhad 1 1 Univ. Grenoble Alpes, LIG, F-38000 Grenoble, France CNRS, LIG, F-38000 Grenoble,

More information

Basic techniques. Text processing; term weighting; vector space model; inverted index; Web Search

Basic techniques. Text processing; term weighting; vector space model; inverted index; Web Search Basic techniques Text processing; term weighting; vector space model; inverted index; Web Search Overview Indexes Query Indexing Ranking Results Application Documents User Information analysis Query processing

More information

Syllabus. 1. Visual classification Intro 2. SVM 3. Datasets and evaluation 4. Shallow / Deep architectures

Syllabus. 1. Visual classification Intro 2. SVM 3. Datasets and evaluation 4. Shallow / Deep architectures Syllabus 1. Visual classification Intro 2. SVM 3. Datasets and evaluation 4. Shallow / Deep architectures Image classification How to define a category? Bicycle Paintings with women Portraits Concepts,

More information

Machine Learning Practice and Theory

Machine Learning Practice and Theory Machine Learning Practice and Theory Day 9 - Feature Extraction Govind Gopakumar IIT Kanpur 1 Prelude 2 Announcements Programming Tutorial on Ensemble methods, PCA up Lecture slides for usage of Neural

More information

EVENT DETECTION AND HUMAN BEHAVIOR RECOGNITION. Ing. Lorenzo Seidenari

EVENT DETECTION AND HUMAN BEHAVIOR RECOGNITION. Ing. Lorenzo Seidenari EVENT DETECTION AND HUMAN BEHAVIOR RECOGNITION Ing. Lorenzo Seidenari e-mail: seidenari@dsi.unifi.it What is an Event? Dictionary.com definition: something that occurs in a certain place during a particular

More information

Report on the TREC-4 Experiment: Combining Probabilistic and Vector-Space Schemes

Report on the TREC-4 Experiment: Combining Probabilistic and Vector-Space Schemes Report on the TREC-4 Experiment: Combining Probabilistic and Vector-Space Schemes Jacques Savoy, Melchior Ndarugendamwo, Dana Vrajitoru Faculté de droit et des sciences économiques Université de Neuchâtel

More information

Classification Key Concepts

Classification Key Concepts http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Classification Key Concepts Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech Parishit

More information

Developing Focused Crawlers for Genre Specific Search Engines

Developing Focused Crawlers for Genre Specific Search Engines Developing Focused Crawlers for Genre Specific Search Engines Nikhil Priyatam Thesis Advisor: Prof. Vasudeva Varma IIIT Hyderabad July 7, 2014 Examples of Genre Specific Search Engines MedlinePlus Naukri.com

More information

Human Action Recognition Using CNN and BoW Methods Stanford University CS229 Machine Learning Spring 2016

Human Action Recognition Using CNN and BoW Methods Stanford University CS229 Machine Learning Spring 2016 Human Action Recognition Using CNN and BoW Methods Stanford University CS229 Machine Learning Spring 2016 Max Wang mwang07@stanford.edu Ting-Chun Yeh chun618@stanford.edu I. Introduction Recognizing human

More information

3D Deep Learning on Geometric Forms. Hao Su

3D Deep Learning on Geometric Forms. Hao Su 3D Deep Learning on Geometric Forms Hao Su Many 3D representations are available Candidates: multi-view images depth map volumetric polygonal mesh point cloud primitive-based CAD models 3D representation

More information

Otto Group Product Classification Challenge

Otto Group Product Classification Challenge Otto Group Product Classification Challenge Hoang Duong May 19, 2015 1 Introduction The Otto Group Product Classification Challenge is the biggest Kaggle competition to date with 3590 participating teams.

More information

Martian lava field, NASA, Wikipedia

Martian lava field, NASA, Wikipedia Martian lava field, NASA, Wikipedia Old Man of the Mountain, Franconia, New Hampshire Pareidolia http://smrt.ccel.ca/203/2/6/pareidolia/ Reddit for more : ) https://www.reddit.com/r/pareidolia/top/ Pareidolia

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 22, 2016 Course Information Website: http://www.stat.ucdavis.edu/~chohsieh/teaching/ ECS289G_Fall2016/main.html My office: Mathematical Sciences

More information

Learning Compact and Effective Distance Metrics with Diversity Regularization. Pengtao Xie. Carnegie Mellon University

Learning Compact and Effective Distance Metrics with Diversity Regularization. Pengtao Xie. Carnegie Mellon University Learning Compact and Effective Distance Metrics with Diversity Regularization Pengtao Xie Carnegie Mellon University 1 Distance Metric Learning Similar Dissimilar Distance Metric Wide applications in retrieval,

More information

Ranking models in Information Retrieval: A Survey

Ranking models in Information Retrieval: A Survey Ranking models in Information Retrieval: A Survey R.Suganya Devi Research Scholar Department of Computer Science and Engineering College of Engineering, Guindy, Chennai, Tamilnadu, India Dr D Manjula Professor

More information

ECCV Presented by: Boris Ivanovic and Yolanda Wang CS 331B - November 16, 2016

ECCV Presented by: Boris Ivanovic and Yolanda Wang CS 331B - November 16, 2016 ECCV 2016 Presented by: Boris Ivanovic and Yolanda Wang CS 331B - November 16, 2016 Fundamental Question What is a good vector representation of an object? Something that can be easily predicted from 2D

More information

Information Retrieval and Web Search

Information Retrieval and Web Search Information Retrieval and Web Search Course overview Instructor: Rada Mihalcea What is this course about? Processing Indexing Retrieving textual data (or audio, video, geo-spatial,, data) Fits in four

More information

Fast Document Clustering Based on Weighted Comparative Advantage

Fast Document Clustering Based on Weighted Comparative Advantage Proceedings of the 2009 IEEE International Conference on Systems, Man, and Cybernetics San Antonio, TX, USA - October 2009 Fast Document Clustering Based on Weighted Comparative Advantage Jie Ji Intelligent

More information

Ranking Function Optimizaton Based on OKAPI and K-Means

Ranking Function Optimizaton Based on OKAPI and K-Means 2016 International Conference on Mechanical, Control, Electric, Mechatronics, Information and Computer (MCEMIC 2016) ISBN: 978-1-60595-352-6 Ranking Function Optimizaton Based on OKAPI and K-Means Jun

More information

Keyword Extraction by KNN considering Similarity among Features

Keyword Extraction by KNN considering Similarity among Features 64 Int'l Conf. on Advances in Big Data Analytics ABDA'15 Keyword Extraction by KNN considering Similarity among Features Taeho Jo Department of Computer and Information Engineering, Inha University, Incheon,

More information

Classification Key Concepts

Classification Key Concepts http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Classification Key Concepts Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech 1 How will

More information

On Identifying Disaster-Related Tweets: Matching-based or Learning based?

On Identifying Disaster-Related Tweets: Matching-based or Learning based? IEEE Big MM 2017, April 19-21, 2017 On Identifying Disaster-Related Tweets: Matching-based or Learning based? Presented by Dr. Seon Ho Kim Hien To Sumeet Agrawal Integrated Media Systems Center University

More information

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology Text classification II CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2015 Some slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

More information

Learn from Web Search Logs to Organize Search Results

Learn from Web Search Logs to Organize Search Results Learn from Web Search Logs to Organize Search Results Xuanhui Wang Department of Computer Science University of Illinois at Urbana-Champaign Urbana, IL 61801 xwang20@cs.uiuc.edu ChengXiang Zhai Department

More information

Automatic Ranking of Images on the Web

Automatic Ranking of Images on the Web Automatic Ranking of Images on the Web HangHang Zhang Electrical Engineering Department Stanford University hhzhang@stanford.edu Zixuan Wang Electrical Engineering Department Stanford University zxwang@stanford.edu

More information

Distribution-free Predictive Approaches

Distribution-free Predictive Approaches Distribution-free Predictive Approaches The methods discussed in the previous sections are essentially model-based. Model-free approaches such as tree-based classification also exist and are popular for

More information

Learning Query and Document Relevance from a Web-scale Click Graph

Learning Query and Document Relevance from a Web-scale Click Graph Learning Query and Document Relevance from a Web-scale Click Graph Shan Jiang, Yuening Hu, Changsung Kang, Tim Daly Jr., Dawei Yin, Yi Chang, Chengxiang Zhai Department of Computer Science University of

More information

IRISA Participation in JRS 2012 Data-Mining Challenge: Lazy-Learning with Vectorization

IRISA Participation in JRS 2012 Data-Mining Challenge: Lazy-Learning with Vectorization IRISA Participation in JRS 2012 Data-Mining Challenge: Lazy-Learning with Vectorization Vincent Claveau To cite this version: Vincent Claveau. IRISA Participation in JRS 2012 Data-Mining Challenge: Lazy-Learning

More information

Boolean Model. Hongning Wang

Boolean Model. Hongning Wang Boolean Model Hongning Wang CS@UVa Abstraction of search engine architecture Indexed corpus Crawler Ranking procedure Doc Analyzer Doc Representation Query Rep Feedback (Query) Evaluation User Indexer

More information

2 Haruechaiyasak, Shyu and Chen identification is proposed. Our topic identification process is based on a classification method which uses a supervis

2 Haruechaiyasak, Shyu and Chen identification is proposed. Our topic identification process is based on a classification method which uses a supervis International Journal of Computational Intelligence and Applications cfl World Scientific Publishing Company IDENTIFYING TOPICS FOR WEB DOCUMENTS THROUGH FUZZY ASSOCIATION LEARNING CHOOCHART HARUECHAIYASAK,

More information

Multimodal topic model for texts and images utilizing their embeddings

Multimodal topic model for texts and images utilizing their embeddings Multimodal topic model for texts and images utilizing their embeddings Nikolay Smelik, smelik@rain.ifmo.ru Andrey Filchenkov, afilchenkov@corp.ifmo.ru Computer Technologies Lab IDP-16. Barcelona, Spain,

More information

Active Browsing using Similarity Pyramids

Active Browsing using Similarity Pyramids Active Browsing using Similarity Pyramids Jau-Yuen Chen, Charles A. Bouman and John C. Dalton School of Electrical and Computer Engineering Purdue University West Lafayette, IN 47907-1285 {jauyuen,bouman}@ecn.purdue.edu

More information

Although implementations and applications vary, the idea of the EMD. and to some extent it mimics the human perception of texture similarities.

Although implementations and applications vary, the idea of the EMD. and to some extent it mimics the human perception of texture similarities. Earth Mover's distance èemdè was rst introduced by Rubner et The for color and texture images ë11, 12ë. This distance can be calculated between al. Introduction two collections of points, when there is

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval Skiing Seminar Information Retrieval 2010/2011 Introduction to Information Retrieval Prof. Ulrich Müller-Funk, MScIS Andreas Baumgart and Kay Hildebrand Agenda 1 Boolean

More information

Large Scale Manifold Transduction

Large Scale Manifold Transduction Large Scale Manifold Transduction Michael Karlen, Jason Weston, Ayse Erkan & Ronan Collobert NEC Labs America, Princeton, USA Ećole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland New York University,

More information

Slide credit from Hung-Yi Lee & Richard Socher

Slide credit from Hung-Yi Lee & Richard Socher Slide credit from Hung-Yi Lee & Richard Socher 1 Review Word Vector 2 Word2Vec Variants Skip-gram: predicting surrounding words given the target word (Mikolov+, 2013) CBOW (continuous bag-of-words): predicting

More information

Optimal transport for machine learning

Optimal transport for machine learning Optimal transport for machine learning Practical sessions Rémi Flamary, Nicolas Courty, Marco Cuturi Data SCience Summer School (DS3) 2018, Paris, France 1 Course organization A day in Optimal Transport

More information

Fall CS646: Information Retrieval. Lecture 2 - Introduction to Search Result Ranking. Jiepu Jiang University of Massachusetts Amherst 2016/09/12

Fall CS646: Information Retrieval. Lecture 2 - Introduction to Search Result Ranking. Jiepu Jiang University of Massachusetts Amherst 2016/09/12 Fall 2016 CS646: Information Retrieval Lecture 2 - Introduction to Search Result Ranking Jiepu Jiang University of Massachusetts Amherst 2016/09/12 More course information Programming Prerequisites Proficiency

More information

Digital Solutions For Advertisers

Digital Solutions For Advertisers Digital Solutions For Advertisers KMA applies a direct marketing approach to our comprehensive digital solutions to meet advertiser needs whether it be data enhancement, qualified branding, list building,

More information

Overview. Non-Parametrics Models Definitions KNN. Ensemble Methods Definitions, Examples Random Forests. Clustering. k-means Clustering 2 / 8

Overview. Non-Parametrics Models Definitions KNN. Ensemble Methods Definitions, Examples Random Forests. Clustering. k-means Clustering 2 / 8 Tutorial 3 1 / 8 Overview Non-Parametrics Models Definitions KNN Ensemble Methods Definitions, Examples Random Forests Clustering Definitions, Examples k-means Clustering 2 / 8 Non-Parametrics Models Definitions

More information