Vector Semantics. Dense Vectors

Similar documents
Vector Semantics. Dense Vectors

Example. Section: PS 709 Examples of Calculations of Reduced Hours of Work Last Revised: February 2017 Last Reviewed: February 2017 Next Review:

Calendar PPF Production Cycles Non-Production Activities and Events

ESANN'2001 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2001, D-Facto public., ISBN ,

Latent Semantic Indexing

Institute For Arts & Digital Sciences SHORT COURSES

Unsupervised learning in Vision

Scheduling. Scheduling Tasks At Creation Time CHAPTER

Math in Focus Vocabulary. Kindergarten

September 2015 Calendar This Excel calendar is blank & designed for easy use as a planner. Courtesy of WinCalendar.com

AIMMS Function Reference - Date Time Related Identifiers

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

CALENDAR OF FILING DEADLINES AND SEC HOLIDAYS

Word2vec and beyond. presented by Eleni Triantafillou. March 1, 2016

Feature selection. LING 572 Fei Xia

Basic Device Management

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2018

Year 1 and 2 Mastery of Mathematics

Calendar Excel Template User Guide

Dimension reduction : PCA and Clustering

CPSC 340: Machine Learning and Data Mining. Multi-Dimensional Scaling Fall 2017

%Addval: A SAS Macro Which Completes the Cartesian Product of Dataset Observations for All Values of a Selected Set of Variables

Unsupervised Learning

June 15, Abstract. 2. Methodology and Considerations. 1. Introduction

Machine Learning for Natural Language Processing. Alice Oh January 17, 2018

Introduction to Geospatial Analysis

INFORMATION TECHNOLOGY SPREADSHEETS. Part 1

Machine Learning for NLP

Mining Human Trajectory Data: A Study on Check-in Sequences. Xin Zhao Renmin University of China,

Machine Learning Practice and Theory

Final Project. Analyzing Reddit Data to Determine Popularity

FastText. Jon Koss, Abhishek Jindal

QI TALK TIME. Run Charts. Speaker: Dr Michael Carton. Connect Improve Innovate. 19 th Dec Building an Irish Network of Quality Improvers

CSE 158. Web Mining and Recommender Systems. Midterm recap

Schedule/BACnet Schedule

Applications Video Surveillance (On-line or off-line)

A Spectral-based Clustering Algorithm for Categorical Data Using Data Summaries (SCCADDS)

Algorithm Collections for Digital Signal Processing Applications Using Matlab

CS 224N: Assignment #1

Clustering and Dimensionality Reduction. Stony Brook University CSE545, Fall 2017

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

ControlLogix/Studio 5000 Logix Designer Course Code Course Title Duration CCP143 Studio 5000 Logix Designer Level 3: Project Development 4.

EE 368. Weeks 5 (Notes)

Midterm Examination CS540-2: Introduction to Artificial Intelligence

Natural Language Processing

Supervised classification of law area in the legal domain

SOUTH DAKOTA BOARD OF REGENTS. Board Work ******************************************************************************

Modelling and Visualization of High Dimensional Data. Sample Examination Paper

Exploring the Structure of Data at Scale. Rudy Agovic, PhD CEO & Chief Data Scientist at Reliancy January 16, 2019

DeepWalk: Online Learning of Social Representations

CS 224N: Assignment #1

GRADE 1 SUPPLEMENT. March Calendar Pattern C7.1

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

Recognition, SVD, and PCA

Clustering and Visualisation of Data

Dimension Reduction CS534

Backup Exec Supplement

Modern GPUs (Graphics Processing Units)

TEMPLATE CALENDAR2015. facemediagroup.co.uk

Analysis and Latent Semantic Indexing

name name C S M E S M E S M E Block S Block M C-Various October Sunday

Lecture 20: Neural Networks for NLP. Zubin Pahuja

Announcements. HW3 problem 4c Kevin Jamieson

Two-Dimensional Visualization for Internet Resource Discovery. Shih-Hao Li and Peter B. Danzig. University of Southern California

Grade 7 Math LESSON 14: MORE PROBLEMS INVOLVING REAL NUMBERS TEACHING GUIDE

SYS 6021 Linear Statistical Models

Employer Portal Guide. BenefitWallet Employer Portal Guide

VK Multimedia Information Systems

Linear and Non-linear Dimentionality Reduction Applied to Gene Expression Data of Cancer Tissue Samples

EEN118 22nd November These are my answers. Extra credit for anyone who spots a mistike. Except for that one. I did it on purpise.

Vector Space Models: Theory and Applications

YEAR 8 STUDENT ASSESSMENT PLANNER SEMESTER 1, 2018 TERM 1

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Getting Started with Calendaring Author: Teresa Sakata

Moody Radio Network 1st Quarter data dates are February, th Quarter Reports are due to SoundExchange by Tuesday, 15-May, 2018

Singular Value Decomposition, and Application to Recommender Systems

Dimension reduction for hyperspectral imaging using laplacian eigenmaps and randomized principal component analysis

Explore Co-clustering on Job Applications. Qingyun Wan SUNet ID:qywan

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

SOUTH DAKOTA BOARD OF REGENTS. Board Work ******************************************************************************

Note: The enumerations range from 0 to (number_of_elements_in_enumeration 1).

Exercises Software Development I. 02 Algorithm Testing & Language Description Manual inspection, test plan, grammar, metasyntax notations (BNF, EBNF)

Clustering Startups Based on Customer-Value Proposition

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011

LRLW-LSI: An Improved Latent Semantic Indexing (LSI) Text Classifier

MAP OF OUR REGION. About

Mining Web Data. Lijun Zhang

2016 Calendar of System Events and Moratoriums

KENYA 2019 Training Schedule

Feature selection. Term 2011/2012 LSI - FIB. Javier Béjar cbea (LSI - FIB) Feature selection Term 2011/ / 22

B.2 Measures of Central Tendency and Dispersion

2/15/12. CS 112 Introduction to Programming. Lecture #16: Modular Development & Decomposition. Stepwise Refinement. Zhong Shao

DIMENSION REDUCTION FOR HYPERSPECTRAL DATA USING RANDOMIZED PCA AND LAPLACIAN EIGENMAPS

Big Apple Academy 2017 Mathematics Department

Difference Between Dates Case Study 2002 M. J. Clancy and M. C. Linn

Contents 10. Graphs of Trigonometric Functions

Organizing and Summarizing Data

ECS289: Scalable Machine Learning

General Instructions. Questions

Transcription:

Vector Semantics Dense Vectors

Sparse versus dense vectors PPMI vectors are long (length V = 20,000 to 50,000) sparse (most elements are zero) Alternative: learn vectors which are short (length 200-1000) dense (most elements are non-zero) 2

Sparse versus dense vectors Why dense vectors? 3 Short vectors may be easier to use as features in machine learning (less weights to tune) Dense vectors may generalize better than storing explicit counts They may do better at capturing synonymy: car and automobile are synonyms; but are represented as distinct dimensions; this fails to capture similarity between a word with car as a neighbor and a word with automobile as a neighbor

Three methods for getting short dense vectors Singular Value Decomposition (SVD) A special case of this is called LSA Latent Semantic Analysis Neural Language Model -inspired predictive models skip-grams and CBOW Brown clustering 4

Vector Semantics Dense Vectors via SVD

Intuition Approximate an N-dimensional dataset using fewer dimensions By first rotating the axes into a new space In which the highest order dimension captures the most variance in the original dataset And the next dimension captures the next most variance, etc. Many such (related) methods: PCA principle components analysis Factor Analysis SVD

Dimensionality reduction 5 PCA dimension 1 4 PCA dimension 2 3 2 1 1 2 3 4 5

Singular Value Decomposition Any rectangular matrix X equals the product of 3 matrices: W: rows corresponding to original but m columns represents a dimension in a new latent space, such that M column vectors are orthogonal to each other Columns are ordered by the amount of variance in the dataset each new dimension accounts for S: diagonal m x m matrix of singular values expressing the importance of each dimension. C: columns corresponding to original but m rows corresponding to 8 singular values

Singular Value Decomposition Contexts 3= m x m m x c wxc w xm 9 Landuaer and Dumais 199

SVD applied to term-document matrix: Latent Semantic Analysis If instead of keeping all m dimensions, we just keep the top k singular values. Let s say 300. The result is a least-squares approximation to the original X But instead of multiplying, we ll just make use of W. Each row of W: 10 A k-dimensional vector Representing word W 3= Contexts k wxc w xm / Deerwester et al (1988) / k m x m / k / k m x c

Let s return to PPMI word-word matrices Can we apply SVD to them? 11

SVD applied to term-term matrix 2 4 X 3 5 V V = 2 4 W 3 5 V V 2 4 s 1 0 0... 0 0 s 2 0... 0 0 0 s 3... 0....... 0 0 0... s V 3 5 V V 2 4 C 3 5 V V 12 (I m simplifying here by assuming the matrix has rank V )

Truncated SVD on term-term matrix 5 4 54 54 2 4 X 3 5 V V = 2 4 W 3 5 V k 2 4 s 1 0 0... 0 0 s 2 0... 0 0 0 s 3... 0....... 0 0 0... s k 3 5 k k h C i k V 13

Truncated SVD produces embeddings Each row of W matrix is a k-dimensional representation of each word w K might range from 50 to 1000 Generally we keep the top k dimensions, but some experiments suggest that getting rid of the top 1 dimension or even the top 50 dimensions is helpful (Lapesa and Evert 2014). embedding for word i 2 3 W 4 5 V k 14

Embeddingsversus sparse vectors Dense SVD embeddings sometimes work better than sparse PPMI matrices at tasks like word similarity 15 Denoising: low-order dimensions may represent unimportant information Truncation may help the models generalize better to unseen data. Having a smaller number of dimensions may make it easier for classifiers to properly weigh the dimensions for the task. Dense models may do better at capturing higher order cooccurrence.

Vector Semantics Embeddings inspired by neural language models: skip-grams and CBOW

Prediction-based models: An alternative way to get dense vectors Skip-gram (Mikolov et al. 2013a) CBOW (Mikolov et al. 2013b) Learn embeddings as part of the process of word prediction. Train a neural network to predict neighboring words Inspired by neural net language models. In so doing, learn dense embeddings for the words in the training corpus. Advantages: 1 Fast, easy to train (much faster than SVD) Available online in the word2vec package Including sets of pretrained embeddings!

Embeddingscapture relational meaning! vector( king ) - vector( man ) + vector( woman ) vector( queen ) vector( Paris ) - vector( France ) + vector( Italy ) vector( Rome ) 18

Vector Semantics Brown clustering

Brown clustering An agglomerative clustering algorithm that clusters words based on which words precede or follow them These word clusters can be turned into a kind of vector We ll give a very brief sketch here. 20

Brown clustering algorithm Each word is initially assigned to its own cluster. We now consider merging each pair of clusters. Highest quality merge is chosen. Quality = merges two words that have similar probabilities of preceding and following words Clustering proceeds until all words are in one big cluster. 21

Brown Clusters as vectors By tracing the order in which clusters are merged, the model builds a binary tree from bottom to top. Each word represented by binary string = path from root to leaf Each intermediate node is a cluster Chairman is 0010, months = 01, and verbs = 1 22 0 1 00 01 10 11 000 001 010 011 100 101 CEO 0010 0011 November October run sprint chairman president walk

Brown cluster examples Friday Monday Thursday Wednesday Tuesday Saturday Sunday weekends Sundays Saturdays June March July April January December October November September August pressure temperature permeability density porosity stress velocity viscosity gravity tension anyone someone anybody somebody had hadn t hath would ve could ve should ve must ve might ve asking telling wondering instructing informing kidding reminding bothering thanking deposing mother wife father son husband brother daughter sister boss uncle great big vast sudden mere sheer gigantic lifelong scant colossal down backwards ashore sideways southward northward overboard aloft downwards adrift Figure 19.1 Some sample Brown clusters from a 20,41-word vocabulary trained on 3 23