Text Modeling with the Trace Norm

Similar documents
June 15, Abstract. 2. Methodology and Considerations. 1. Introduction

ENHANCEMENT OF METICULOUS IMAGE SEARCH BY MARKOVIAN SEMANTIC INDEXING MODEL

3 Nonlinear Regression

3 Nonlinear Regression

Behavioral Data Mining. Lecture 18 Clustering

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

Expectation Maximization (EM) and Gaussian Mixture Models

5. GENERALIZED INVERSE SOLUTIONS

Semi-Parametric and Non-parametric Term Weighting for Information Retrieval

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

Chapter 15 Introduction to Linear Programming

Non-linear dimension reduction

Lecture 5: Properties of convex sets

Improving Probabilistic Latent Semantic Analysis with Principal Component Analysis

Limitations of Matrix Completion via Trace Norm Minimization

Novel Lossy Compression Algorithms with Stacked Autoencoders

Chapter 6 Continued: Partitioning Methods

Chapter 2 Basic Structure of High-Dimensional Spaces

Convex Clustering with Exemplar-Based Models

SGN (4 cr) Chapter 11

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University

CPSC 340: Machine Learning and Data Mining. Kernel Trick Fall 2017

BOOLEAN MATRIX FACTORIZATIONS. with applications in data mining Pauli Miettinen

Lasso. November 14, 2017

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Some Advanced Topics in Linear Programming

Note Set 4: Finite Mixture Models and the EM Algorithm

Lecture 2 September 3

Two-Dimensional Visualization for Internet Resource Discovery. Shih-Hao Li and Peter B. Danzig. University of Southern California

Scaling the Topology of Symmetric, Second-Order Planar Tensor Fields

Segmentation: Clustering, Graph Cut and EM

Part II: OUTLINE. Visualizing Quaternions. Part II: Visualizing Quaternion Geometry. The Spherical Projection Trick: Visualizing unit vectors.

Exploiting a database to predict the in-flight stability of the F-16

Generative and discriminative classification techniques

A ew Algorithm for Community Identification in Linked Data

Convex Optimization / Homework 2, due Oct 3

2D Object Definition (1/3)

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

CS354 Computer Graphics Rotations and Quaternions

CS5220: Assignment Topic Modeling via Matrix Computation in Julia

Forestry Applied Multivariate Statistics. Cluster Analysis

Vector Space Models: Theory and Applications

Module 9 : Numerical Relaying II : DSP Perspective

Multiple View Geometry in Computer Vision

Three Dimensional Geometry. Linear Programming

Lab # 2 - ACS I Part I - DATA COMPRESSION in IMAGE PROCESSING using SVD

CS 224d: Assignment #1

Machine Learning / Jan 27, 2010

Directed Graphical Models (Bayes Nets) (9/4/13)

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009

The Simplex Algorithm

Spatial Patterns Point Pattern Analysis Geographic Patterns in Areal Data

SUPPLEMENTARY FILE S1: 3D AIRWAY TUBE RECONSTRUCTION AND CELL-BASED MECHANICAL MODEL. RELATED TO FIGURE 1, FIGURE 7, AND STAR METHODS.

Programming, numerics and optimization

Probabilistic Graphical Models

Variational Autoencoders. Sargur N. Srihari

A General Framework for Contour

Annotated multitree output

Deep Generative Models Variational Autoencoders

Information Retrieval. hussein suleman uct cs

STA 4273H: Statistical Machine Learning

CS 224N: Assignment #1

Deepest Neural Networks

LRLW-LSI: An Improved Latent Semantic Indexing (LSI) Text Classifier

Regularization and model selection

Ensemble methods in machine learning. Example. Neural networks. Neural networks

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

CS 229 Midterm Review

Feature Selection for fmri Classification

Structured Learning. Jun Zhu

Clustering CS 550: Machine Learning

An introduction to interpolation and splines

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling

Lecture 8: Fitting. Tuesday, Sept 25

A Visualization Tool to Improve the Performance of a Classifier Based on Hidden Markov Models

Joint Entity Resolution

Multiresponse Sparse Regression with Application to Multidimensional Scaling

LP-Modelling. dr.ir. C.A.J. Hurkens Technische Universiteit Eindhoven. January 30, 2008

K-Means and Gaussian Mixture Models

1 : Introduction to GM and Directed GMs: Bayesian Networks. 3 Multivariate Distributions and Graphical Models

Graphics and Interaction Transformation geometry and homogeneous coordinates

COMP30019 Graphics and Interaction Transformation geometry and homogeneous coordinates

Convex Clustering with Exemplar-Based Models

Leaning Graphical Model Structures using L1-Regularization Paths (addendum)

Overview of Clustering

The Use of Biplot Analysis and Euclidean Distance with Procrustes Measure for Outliers Detection

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011

THE HALF-EDGE DATA STRUCTURE MODELING AND ANIMATION

Feature Selection Using Modified-MCA Based Scoring Metric for Classification

Mathematics 308 Geometry. Chapter 9. Drawing three dimensional objects

Mixture models and clustering

Unsupervised Learning

Lecture 3 January 22

Query Likelihood with Negative Query Generation

Introduction to Modern Control Systems

CS6220: DATA MINING TECHNIQUES

Clustering Documents in Large Text Corpora

Supplementary Material: The Emergence of. Organizing Structure in Conceptual Representation

Transcription:

Text Modeling with the Trace Norm Jason D. M. Rennie jrennie@gmail.com April 14, 2006 1 Introduction We have two goals: (1) to find a low-dimensional representation of text that allows generalization to unseen data, and (2) to group documents according to their similarities. Clearly these tasks are related similar documents can be represented more compactly than dissimilar documents and vice versa. First, we discuss a model for text that smoothly penalizes the rank of the parameter matrix. Then, we discuss how to apply this model to clustering. 2 Text Model We use a generative model and assume that the documents are not identifiable beyond their term frequency statistics. Let Y be the matrix of term frequency statistics, one document per row. Let P (Y X) be the likelihood model where X is a matrix of parameters for the model. A simple example that we will use is a set of multinomial models, one per document where each row of X parameterizes the multinomial for the corresponding document. We take the product of these per-document models to get P (Y X). We assume a prior on the parameter matrix, P (X). This prior is one way in which we determine the similarity of documents, which, in turn, determines how compactly we can represent the documents. We make use of the trace norm (sum of singular values of a matrix) and discuss parameter and likelihood choices which are appropriate for its use. 2.1 Parameter Prior The parameter prior for our model, P (X), is the vehicle through which we obtain a reduced representation of the data. In particular, we want the prior to give high weight to a matrix that reflects natural similarities in the data. One natural bias is to weight highly parameter matrices of low rank. It is often the case that data is produced from relatively few underlying factors. This Updated June 16, 2006. 1

X 1 X 2 X Σ sin θ (1, 0) ( (1, 0) ) 1.41 0 (1, 0) 1 1 2, 2 1.85.71 (1, 0) (0, 1) ) 2 1 (1, 0) ( 2 1 1, 2 1.85.71 (1, 0) ( 1, 0) 1.41 0 Table 1: The first two columns give examples of pairs of unit length vectors. The third column gives the trace norm of the matrix of stacked vectors. The fourth column gives the sine of the angle between the vectors. The trace norm is closely approximated by a linear function of sin θ. reflects the idea that documents might be composed of a combination of a few underlying themes or topics. This bias is common [2, 4, 1]. Another natural bias for text is to assume that the vectors corresponding to the data are close together. Term frequency vectors are often normalized to unit Euclidean length and treated as points on the unit sphere. In this representation, we can measure angles between vectors. Documents are similar if the angle between their vector representations are close. The trace norm can be used as a (smooth) measure of compactness of a set of vectors. It also has strong ties to matrix rank. 2.1.1 Trace Norm The trace norm of a matrix is the sum of its singular values. Let X = UΣV T be the singular value decomposition of X, where U T U = I, Σ = diag(σ), and V T V = I. Then, the trace norm of X is X Σ = i σ i. (1) The trace norm is related to the rank of a matrix. In particular, Fazel ( 5.1.4, 5.1.5 of [3]) showed that the convex hull of matrices with bounded rank r is identical to the space of matrices with unit spectral norm and bounded trace norm ( r). Furthermore, the trace norm provides information about how close vectors are to each other. To help the reader gain intuition for the trace norm, we provide simple examples in Table 1. Listed are pairs of 2-dimensional vectors, their trace norm when stacked as rows of a matrix, X = [X 1 ; X 2 ], and the sine of the angle formed by the vectors (which, due to the simplicity of the example is just the second component of X 2 ). Unfortunately, the trace norm provides neither an upper nor lower bound on rank. Nor is it a linear function of the sine of the angle. However, it provides a smooth function of the matrix which is strongly correlated with these measures. In the example, the trace norm is largest when the vectors are perpendicular, and smallest when the vectors are co-linear. In our example, the trace norm is closely approximated by a linear function of the 2

sine of the angle between vectors, f(θ) = a sin θ + b where a = 2 2 and b = 2. Previous work on text modeling has focused on limiting the rank of the representation of the term frequency matrix. Here, we use the trace norm to give us a smooth penalty that encourages low rank and a compactness of the parameter vectors. 2.1.2 Trace Norm Prior We utilize the trace norm by incorporating it into the parameter prior. esablish a prior that is the exponentiated negative of the trace norm, We P (X) exp( X Σ ). (2) Thus, maximization of a joint likelihood encourages a low-rank parameter matrix with row/column vectors in relatively close proximity. The negative loglikelihood (NLL) of the prior is the trace norm, so if we were to interpret our model as an encoding framework, the value of the trace norm would serve as the parameter encoding length [5]. 2.2 Likelihood The multinomial is a simple, standard model of term frequency for text. It is the model of term frequency that results from the unigram model, where the word for each position in the documents is drawn iid. The multinomial can be parameterized in multiple ways. Most common is the mean parameterization, P (y θ) = n! i y i! i θ yi i, (3) where n = i y i is the fixed document length and the parameter for word i, 0 θ i 1, is also its expected rate of occurrence. The above is for a single document, y, and parameter vector, θ. For multiple documents, we take the product of likelihoods and stack parameters into a matrix, P (Y Θ) = i P (Y i Θ i ), where indices indicate row vectors. The mean parameterization is intuitively pleasing because each of the parameters {θ i } is the expected rate of occurrence for the corresponding word. However, the fact that the parameters are constrained make it somewhat awkward to use with the trace norm prior. And, the logarithmic transform applied to the parameters in the NLL means that intuition is often wrong about the effects of small parameter changes. An alternate choice is the so-called natural parameterization. The natural parameters are related to the mean parameters through a simple formula, θ i = exp(z i) j exp(z j). (4) Unlike the mean paramterization, the natural parameters are unconstrained and their interaction with the data in the NLL is linear, so the effects of parameter 3

changes are easier to understand. In the next two sections, we discuss the interactions of these parameterizations with the trace norm prior. 2.2.1 Mean Parameterization For the mean parameterization, the likelihood values must be non-negative and sum-to-one. We can take care of this restriction via transformation between the parameters used for the prior, X, and those used for the likelihood, Θ, Θ ij = X ij j X ij. (5) Thus, our joint probability becomes P (Y Θ)P (X). But, in this context, the prior is meaningless since even as X Σ 0 (equivalently, P (X) 1), we can attain the maximum likelihood. Even a simple constraint on the trace norm, such as X Σ 1, is insufficient to overcome the extra degrees of freedom introduced by the transform, (5). Why, then, don t we apply the trace norm prior directly to the mean parameters, P (Θ)? The reason is that this would create an undesirable bias. The trace norm measures lengths using the L 2 norm. I.e. the trace norm of a mean parameter vector (which has unit L 1 length) can vary from 1 to 1 d, where d is the dimensionality of the vector. Such a prior would serve to encourage low entropy parameter vectors more than it would encourage the set of parameter vectors to be of low rank. We can correct this bias by replacing the L 1 constraint with an L 2 constraint on the parameter vectors used for the trace norm calculation. Our joint probability is P (Y Θ)P (X) where (5) is used to convert between Θ and X. For the optimization, we apply the constraint that each row of X has unit L 2 norm. Thus, the trace norm of any row of X is 1, and the prior only imparts a bias on the relative locations of the parameter vectors, not their absolute locations. The values in Table 1 give intuition for how the trace norm is affected by different orientations of a pair of unit L 2 length vectors. 2.2.2 Natural Parameterization The natural parameterization leads to a simpler model. There is one undesirable bias. However, this is fixed naturally by introducing a parameter hierarchy. We begin with an short introduction to the exponential family of distributions. A likelihood P (y x) is in the exponential family if it can be written as P (y x) = a(y)b(x)e c(y)t d(x), (6) where a and b are functions that return scalar values; c and d are functions that return vector values. The density is in canonical form if c(y) = y; in canonical form, d(x) is the natural paramter. For the multinomial, we have a(y) = i log y i!, b(x) = log n! n log i exp(x i), c(y) = y and d(x) = x, or log P (y x) = i log y i! log n! + n log j exp(x j ) i y i x i, (7) 4

1 0.8 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 1 0.8 0.6 0.4 0.2 0 Figure 1: Shown are boundaries of regions enclosing the convex hull of the points (.9,.07,.03), (.07,.03,.9), and (.03,.9,.07). Black (outer, solid) lines bound the simplex. The blue (inner, solid) line bounds the mean parameter convex hull. The red (dotted) line bounds the natural parameter convex hull. To construct the natural parameter convex hull, we find natural parameter representations of the three points, determine the convex hull in natural parameter space, then transform the hull to mean parameter space. Viewed in mean parameter space, the natural parameter convex hull is not convex. σ Y Figure 2: Graphical model for the multi-prior model. X where n = i y i is the document length. I.e. P (y x) is in canonical form and x is the natural parameter. As with any natural parameterization, the rate of change in NLL for a small change in the parameter is the difference between expected and empirical values. It is valuable to understand how natural parameters behave on the simplex. Due to the transformation between natural and mean parameters, (5), convex combinations of natural parmeters do not lead to convex regions on the simplex. Figure 1 provides an example. Our natural parameter model is described by the joint distribution, P (X)P (Y X), where the likelihood is a product of natural parameter multinomials, P (Y X) = i P (Y i X i ), where Y i is the i th row of Y and P (X) is the trace norm prior. This model prefers a low magnitude, low rank parameter matrix. The low rank bias is desirable, but the low magnitude bias is not it too strongly encour- 5

ages points close to the middle of the simplex 1. The trace norm prior penalizes each row of X separately for its distance from the origin of the natural parameter space. The origin is the de facto center for the trace norm calculation. However, the origin is usually a poor center for text some words are simply more common than others. We can shift the center by introducing a new parameter vector which is added to each row of X for the likelihood parameters. Let σ be this center parameter vector. Our updated joint distribution is P (σ)p (X) i P (Y i X i + σ). X and σ are independent for generation (Y unobserved), but dependent for inference (Y observed). We call this the multi-prior model (see Figure 2). A natural choice for P (σ) is the trace norm prior 2. In effect, only a single parameter vector will be biased towards the origin. Hence, for a large data set (many documents), this bias will be quite small. Though unclear in the graphical model, the multi-prior model reflects hierarchical structure in the data. It can be easily extended to deal with additional structure, such as classes, sub-classes, etc. 3 Discussion Use of the trace norm for a parameter prior for text modeling provides a smooth low-rank. The trace norm prior can be utilized in either a mean parameter or natural parameter framework. Each has minor issues that can be addressed easily, yielding two frameworks for learning low-rank text models. We find the natural parameter model particularly pleasing due to its intrinsic advantages lack of constraints, a simple derivative, and ease in modeling of hierarchical structure. References [1] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. In Advances in Neural Information Processing Systems 14, 2002. [2] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391 407, 1990. [3] M. Fazel. Matrix Rank Minimization with Applications. PhD thesis, Stanford University, 2002. [4] T. Hofmann. Probabilistic latent semantic indexing. In Proceedings of the 22nd International Conference on Research and Development in Information Retrieval (SIGIR 99), 1999. [5] J. D. M. Rennie. Encoding model parameters. people.csail.mit.edu/jrennie/writing, May 2003. 1 The middle of the simplex corresponds to a natural parameter vector of all zeros. 2 Note that since σ is a vector, its trace norm is simply its L 2 length, σ Σ = σ 2. 6