Vector Semantics. Dense Vectors

Similar documents
Vector Semantics. Dense Vectors

Example. Section: PS 709 Examples of Calculations of Reduced Hours of Work Last Revised: February 2017 Last Reviewed: February 2017 Next Review:

ESANN'2001 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2001, D-Facto public., ISBN ,

Calendar PPF Production Cycles Non-Production Activities and Events

Math in Focus Vocabulary. Kindergarten

Institute For Arts & Digital Sciences SHORT COURSES

Scheduling. Scheduling Tasks At Creation Time CHAPTER

CS 6140: Machine Learning Spring 2017

CALENDAR OF FILING DEADLINES AND SEC HOLIDAYS

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

September 2015 Calendar This Excel calendar is blank & designed for easy use as a planner. Courtesy of WinCalendar.com

AIMMS Function Reference - Date Time Related Identifiers

Feature selection. LING 572 Fei Xia

Word2vec and beyond. presented by Eleni Triantafillou. March 1, 2016

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2018

Calendar Excel Template User Guide

Introduction to Geospatial Analysis

Mining Human Trajectory Data: A Study on Check-in Sequences. Xin Zhao Renmin University of China,

FastText. Jon Koss, Abhishek Jindal

Machine Learning Practice and Theory

Machine Learning for Natural Language Processing. Alice Oh January 17, 2018

Unsupervised learning in Vision

Year 1 and 2 Mastery of Mathematics

CPSC 340: Machine Learning and Data Mining. Multi-Dimensional Scaling Fall 2017

Basic Device Management

Natural Language Processing

Latent Semantic Indexing

%Addval: A SAS Macro Which Completes the Cartesian Product of Dataset Observations for All Values of a Selected Set of Variables

Two-Dimensional Visualization for Internet Resource Discovery. Shih-Hao Li and Peter B. Danzig. University of Southern California

June 15, Abstract. 2. Methodology and Considerations. 1. Introduction

QI TALK TIME. Run Charts. Speaker: Dr Michael Carton. Connect Improve Innovate. 19 th Dec Building an Irish Network of Quality Improvers

Schedule/BACnet Schedule

CSE 158. Web Mining and Recommender Systems. Midterm recap

GRADE 1 SUPPLEMENT. March Calendar Pattern C7.1

INFORMATION TECHNOLOGY SPREADSHEETS. Part 1

Introduc)on to Probabilis)c Latent Seman)c Analysis. NYP Predic)ve Analy)cs Meetup June 10, 2010

EE 368. Weeks 5 (Notes)

Supervised vs unsupervised clustering

ControlLogix/Studio 5000 Logix Designer Course Code Course Title Duration CCP143 Studio 5000 Logix Designer Level 3: Project Development 4.

B.2 Measures of Central Tendency and Dispersion

Difference Between Dates Case Study 2002 M. J. Clancy and M. C. Linn

SOUTH DAKOTA BOARD OF REGENTS. Board Work ******************************************************************************

VK Multimedia Information Systems

Minimum Redundancy and Maximum Relevance Feature Selec4on. Hang Xiao

Final Project. Analyzing Reddit Data to Determine Popularity

SYS 6021 Linear Statistical Models

STA 4273H: Sta-s-cal Machine Learning

Getting Started with Calendaring Author: Teresa Sakata

CS 224N: Assignment #1

Ruslan Salakhutdinov and Geoffrey Hinton. University of Toronto, Machine Learning Group IRGM Workshop July 2007

highest cosine coecient [5] are returned. Notice that a query can hit documents without having common terms because the k indexing dimensions indicate

Vector Space Models: Theory and Applications

Machine Learning Crash Course: Part I

Backup Exec Supplement

Organizing and Summarizing Data

Supervised classification of law area in the legal domain

Exercises Software Development I. 02 Algorithm Testing & Language Description Manual inspection, test plan, grammar, metasyntax notations (BNF, EBNF)

DeepWalk: Online Learning of Social Representations

Lecture 20: Neural Networks for NLP. Zubin Pahuja

TEMPLATE CALENDAR2015. facemediagroup.co.uk

Announcements. HW3 problem 4c Kevin Jamieson

name name C S M E S M E S M E Block S Block M C-Various October Sunday

Text Modeling with the Trace Norm

YEAR 8 STUDENT ASSESSMENT PLANNER SEMESTER 1, 2018 TERM 1

Predic'ng ALS Progression with Bayesian Addi've Regression Trees

Lecture 24: Image Retrieval: Part II. Visual Computing Systems CMU , Fall 2013

Employer Portal Guide. BenefitWallet Employer Portal Guide

Unsupervised Learning

CS 224N: Assignment #1

Homework Assignment #3

EEN118 22nd November These are my answers. Extra credit for anyone who spots a mistike. Except for that one. I did it on purpise.

MONITORING REPORT ON THE WEBSITE OF THE STATISTICAL SERVICE OF CYPRUS DECEMBER The report is issued by the.

Search Engines. Informa1on Retrieval in Prac1ce. Annotations by Michael L. Nelson

Moody Radio Network 1st Quarter data dates are February, th Quarter Reports are due to SoundExchange by Tuesday, 15-May, 2018

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

Recognition, SVD, and PCA

SOUTH DAKOTA BOARD OF REGENTS. Board Work ******************************************************************************

Mining Web Data. Lijun Zhang

Nortel Enterprise Reporting Quality Monitoring Meta-Model Guide

HVRSD Standards-Based Report Card Correlations for Math. Grade 1

Midterm Examination CS540-2: Introduction to Artificial Intelligence

MAP OF OUR REGION. About

Algorithm Collections for Digital Signal Processing Applications Using Matlab

Clustering and Visualisation of Data

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011

2016 Calendar of System Events and Moratoriums

Recommendation with Differential Context Weighting

Informa(on Retrieval

Decision Trees, Random Forests and Random Ferns. Peter Kovesi

Big Apple Academy 2017 Mathematics Department

KENYA 2019 Training Schedule

Aerospace Software Engineering

Stat 428 Autumn 2006 Homework 2 Solutions

Machine Learning for NLP

Grade 7 Math LESSON 14: MORE PROBLEMS INVOLVING REAL NUMBERS TEACHING GUIDE

CS 224d: Assignment #1

Exploring the Structure of Data at Scale. Rudy Agovic, PhD CEO & Chief Data Scientist at Reliancy January 16, 2019

CIMA Certificate BA Interactive Timetable

General Instructions. Questions

Transcription:

Vector Semantics Dense Vectors

Sparse versus dense vectors PPMI vectors are long (length V = 20,000 to 50,000) sparse (most elements are zero) Alterna>ve: learn vectors which are short (length 200-1000) dense (most elements are non- zero) 2

Sparse versus dense vectors Why dense vectors? Short vectors may be easier to use as features in machine learning (less weights to tune) 3 Dense vectors may generalize beler than storing explicit counts They may do beler at capturing synonymy: car and automobile are synonyms; but are represented as dis>nct dimensions; this fails to capture similarity between a word with car as a neighbor and a word with automobile as a neighbor

Three methods for ge5ng short dense vectors Singular Value Decomposi>on (SVD) A special case of this is called LSA Latent Seman>c Analysis Neural Language Model - inspired predic>ve models skip- grams and CBOW Brown clustering 4

Vector Semantics Dense Vectors via SVD

Intui8on Approximate an N- dimensional dataset using fewer dimensions By first rota>ng the axes into a new space In which the highest order dimension captures the most variance in the original dataset And the next dimension captures the next most variance, etc. Many such (related) methods: PCA principle components analysis Factor Analysis SVD

Dimensionality reduc8on 5 PCA dimension 1 4 PCA dimension 2 3 2 1 1 2 3 4 5

Singular Value Decomposi8on Any rectangular matrix X equals the product of 3 matrices: W: rows corresponding to original but m columns represents a dimension in a new latent space, such that M column vectors are orthogonal to each other Columns are ordered by the amount of variance in the dataset each new dimension accounts for S: diagonal m x m matrix of singular values expressing the importance of each dimension. C: columns corresponding to original but m rows corresponding to 8 singular values

Singular Value Decomposi8on Contexts 3= m x m m x c wxc w xm 9 Landuaer and Dumais 199

SVD applied to term- document matrix: Latent Seman8c Analysis If instead of keeping all m dimensions, we just keep the top k singular values. Let s say 300. The result is a least- squares approxima>on to the original X But instead of mul>plying, we ll just make use of W. Each row of W: 10 A k- dimensional vector Represen>ng word W 3= Contexts k wxc w xm / Deerwester et al (1988) / k m x m / k / k m x c

Let s return to PPMI word- word matrices Can we apply SVD to them? 11

SVD applied to term- term matrix 2 4 X 3 5 V V = 2 4 W 3 5 V V 2 4 s 1 0 0... 0 0 s 2 0... 0 0 0 s 3... 0....... 0 0 0... s V 3 5 V V 2 4 C 3 5 V V 12 (I m simplifying here by assuming the matrix has rank V )

Truncated SVD on term- term matrix 5 4 54 54 2 4 X 3 5 V V = 2 4 W 3 5 V k 2 4 s 1 0 0... 0 0 s 2 0... 0 0 0 s 3... 0....... 0 0 0... s k 3 5 k k h C i k V 13

14 Truncated SVD produces embeddings Each row of W matrix is a k- dimensional representa>on of each word w K might range from 50 to 1000 Generally we keep the top k dimensions, but some experiments suggest that gelng rid of the top 1 dimension or even the top 50 dimensions is helpful (Lapesa and Evert 2014). embedding for word i 2 4 W V k 3 5

Embeddings versus sparse vectors Dense SVD embeddings some>mes work beler than sparse PPMI matrices at tasks like word similarity 15 Denoising: low- order dimensions may represent unimportant informa>on Trunca>on may help the models generalize beler to unseen data. Having a smaller number of dimensions may make it easier for classifiers to properly weigh the dimensions for the task. Dense models may do beler at capturing higher order co- occurrence.

Vector Semantics Embeddings inspired by neural language models: skip- grams and CBOW

Predic8on- based models: An alterna8ve way to get dense vectors Skip- gram (Mikolov et al. 2013a) CBOW (Mikolov et al. 2013b) Learn embeddings as part of the process of word predic>on. Train a neural network to predict neighboring words Inspired by neural net language models. In so doing, learn dense embeddings for the words in the training corpus. Advantages: 1 Fast, easy to train (much faster than SVD) Available online in the word2vec package Including sets of pretrained embeddings!

Embeddings capture rela8onal meaning! vector( king ) - vector( man ) + vector( woman ) vector( queen ) vector( Paris ) - vector( France ) + vector( Italy ) vector( Rome ) 18

Vector Semantics Brown clustering

Brown clustering An agglomera>ve clustering algorithm that clusters words based on which words precede or follow them These word clusters can be turned into a kind of vector We ll give a very brief sketch here. 20

Brown clustering algorithm Each word is ini>ally assigned to its own cluster. We now consider merging each pair of clusters. Highest quality merge is chosen. Quality = merges two words that have similar probabili>es of preceding and following words (More technically quality = smallest decrease in the likelihood of the corpus according to a class- based language model) Clustering proceeds un>l all words are in one big cluster. 21

Brown Clusters as vectors By tracing the order in which clusters are merged, the model builds a binary tree from bolom to top. Each word represented by binary string = path from root to leaf Each intermediate node is a cluster Chairman is 0010, months = 01, and verbs = 1 22 0 1 00 01 10 11 000 001 010 011 100 101 CEO 0010 0011 November October run sprint chairman president walk

Brown cluster examples Friday Monday Thursday Wednesday Tuesday Saturday Sunday weekends Sundays Saturdays June March July April January December October November September August pressure temperature permeability density porosity stress velocity viscosity gravity tension anyone someone anybody somebody had hadn t hath would ve could ve should ve must ve might ve asking telling wondering instructing informing kidding reminding bothering thanking deposing mother wife father son husband brother daughter sister boss uncle great big vast sudden mere sheer gigantic lifelong scant colossal down backwards ashore sideways southward northward overboard aloft downwards adrift Figure 19.1 Some sample Brown clusters from a 20,41-word vocabulary trained on 3 23

Vector Semantics Evalua>ng similarity

Evalua8ng similarity Extrinsic (task- based, end- to- end) Evalua>on: Ques>on Answering Spell Checking Essay grading Intrinsic Evalua>on: Correla>on between algorithm and human word similarity ra>ngs Wordsim353: 353 noun pairs rated 0-10. sim(plane,car)=5. Taking TOEFL mul>ple- choice vocabulary tests Levied is closest in meaning to: imposed, believed, requested, correlated

Summary Distribu>onal (vector) models of meaning Sparse (PPMI- weighted word- word co- occurrence matrices) Dense: Word- word SVD 50-2000 dimensions Skip- grams and CBOW Brown clusters 5-20 binary dimensions. 2