CS5220: Assignment Topic Modeling via Matrix Computation in Julia

Similar documents
K-Means and Gaussian Mixture Models

General Instructions. Questions

CSE 547: Machine Learning for Big Data Spring Problem Set 2. Please read the homework submission policies.

Advanced Operations Research Techniques IE316. Quiz 1 Review. Dr. Ted Ralphs

Mixture Models and the EM Algorithm

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Programming Exercise 4: Neural Networks Learning

Text Modeling with the Trace Norm

Machine Learning / Jan 27, 2010

Numerical Linear Algebra

Note Set 4: Finite Mixture Models and the EM Algorithm

Problem set 2. Problem 1. Problem 2. Problem 3. CS261, Winter Instructor: Ashish Goel.

Behavioral Data Mining. Lecture 18 Clustering

Using Existing Numerical Libraries on Spark

GPUML: Graphical processors for speeding up kernel machines

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

Part 4. Decomposition Algorithms Dantzig-Wolf Decomposition Algorithm

1 Definition of Reduction

Project 1. Implementation. Massachusetts Institute of Technology : Error Correcting Codes Laboratory March 4, 2004 Professor Daniel A.

Linear Methods for Regression and Shrinkage Methods

CS281 Section 3: Practical Optimization

MTH309 Linear Algebra: Maple Guide

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

Subspace Clustering with Global Dimension Minimization And Application to Motion Segmentation

Using Subspace Constraints to Improve Feature Tracking Presented by Bryan Poling. Based on work by Bryan Poling, Gilad Lerman, and Arthur Szlam

CSC 411 Lecture 18: Matrix Factorizations

Reviewer Profiling Using Sparse Matrix Regression

Clustering using Topic Models

Programming Exercise 3: Multi-class Classification and Neural Networks

CSE 6242 A / CS 4803 DVA. Feb 12, Dimension Reduction. Guest Lecturer: Jaegul Choo

Lecture 2 September 3

Unsupervised learning in Vision

DEGENERACY AND THE FUNDAMENTAL THEOREM

Convex Optimization / Homework 2, due Oct 3

CS 450 Numerical Analysis. Chapter 7: Interpolation

Analysis and Latent Semantic Indexing

1 Case study of SVM (Rob)

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011

Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please)

CS 224N: Assignment #1

Image Processing. Image Features

An Approximate Singular Value Decomposition of Large Matrices in Julia

A Neuro Probabilistic Language Model Bengio et. al. 2003

Mining Web Data. Lijun Zhang

Supervised classification of law area in the legal domain

Section Notes 5. Review of Linear Programming. Applied Math / Engineering Sciences 121. Week of October 15, 2017

Linear Programming. Readings: Read text section 11.6, and sections 1 and 2 of Tom Ferguson s notes (see course homepage).

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Linear Programming. Widget Factory Example. Linear Programming: Standard Form. Widget Factory Example: Continued.

Name: Math 310 Fall 2012 Toews EXAM 1. The material we have covered so far has been designed to support the following learning goals:

CS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.

Workload Characterization Techniques

A Course in Machine Learning

CS 224N: Assignment #1

Principal Component Analysis for Distributed Data

A fast algorithm for sparse reconstruction based on shrinkage, subspace optimization and continuation [Wen,Yin,Goldfarb,Zhang 2009]

(Sparse) Linear Solvers

ECE521 Lecture 18 Graphical Models Hidden Markov Models

Integral Geometry and the Polynomial Hirsch Conjecture

Chapter 1 - The Spark Machine Learning Library

Lecture notes on the simplex method September We will present an algorithm to solve linear programs of the form. maximize.

Introduction to Counting, Some Basic Principles

V6 Programming Fundamentals: Part 1 Stored Procedures and Beyond David Adams & Dan Beckett. All rights reserved.

Linear Programming. Course review MS-E2140. v. 1.1

Mathematical Programming and Research Methods (Part II)

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Supervised vs. Unsupervised Learning

Contents. I Basics 1. Copyright by SIAM. Unauthorized reproduction of this article is prohibited.

Bindel, Spring 2015 Numerical Analysis (CS 4220) Figure 1: A blurred mystery photo taken at the Ithaca SPCA. Proj 2: Where Are My Glasses?

Introduction to Mathematical Programming IE406. Lecture 20. Dr. Ted Ralphs

15-451/651: Design & Analysis of Algorithms October 11, 2018 Lecture #13: Linear Programming I last changed: October 9, 2018

Lecture 11: Randomized Least-squares Approximation in Practice. 11 Randomized Least-squares Approximation in Practice

Homework 1 (a and b) Convex Sets and Convex Functions

CS388C: Combinatorics and Graph Theory

Using Numerical Libraries on Spark

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

Tuesday, April 10. The Network Simplex Method for Solving the Minimum Cost Flow Problem

Math 2B Linear Algebra Test 2 S13 Name Write all responses on separate paper. Show your work for credit.

ACM ICPC World Finals 2011

CS 542G: Barnes-Hut. Robert Bridson. November 5, 2008

Machine Problem 8 - Mean Field Inference on Boltzman Machine

Bilevel Sparse Coding

CS675: Convex and Combinatorial Optimization Spring 2018 Consequences of the Ellipsoid Algorithm. Instructor: Shaddin Dughmi

Aim. Structure and matrix sparsity: Part 1 The simplex method: Exploiting sparsity. Structure and matrix sparsity: Overview

10 Introduction to Distributed Computing

Singular Value Decomposition, and Application to Recommender Systems

Implementation of a High-Performance Distributed Web Crawler and Big Data Applications with Husky

MVAPICH2 vs. OpenMPI for a Clustering Algorithm

5. GENERALIZED INVERSE SOLUTIONS

Inference and Representation

10-701/15-781, Fall 2006, Final

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

15.082J and 6.855J. Lagrangian Relaxation 2 Algorithms Application to LPs

Parallel Linear Algebra in Julia

CS 231: Algorithmic Problem Solving

Chapter 2 Basic Structure of High-Dimensional Spaces

60 2 Convex sets. {x a T x b} {x ã T x b}

Information Retrieval. (M&S Ch 15)

D-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview

Transcription:

CS5220: Assignment Topic Modeling via Matrix Computation in Julia April 7, 2014 1 Introduction Topic modeling is widely used to summarize major themes in large document collections. Topic modeling methods automatically extract topic clusters, represented by words and their relative intensities. In this assignment, we explore topic extraction using a type of non-negative matrix factorization of a matrix that captures correlations between words in a document collection. In this model, the matrix factors represent latent thematic structures. Topic # Top Words 1 gas station inside card case store location 2 thai pad curry spicy rice dish tea 3 pizza crust cheese sauce thin pepperoni toppings 4 trail parking lot park easy mountain phoenix view 5 sum wasn t minutes won t bad wanted phoenix 6 yoga classes class work favorite years hot 7 donuts coffee home day breakfast i ll shop Table 1: First seven topic clusters from Yelp reviews 2 Model Topic models are statistical models that represent documents in terms of shared topics. As these topics are not explicitly visible, the model assumes: 1. Each topic is a mixture of words and their relative intensities. 1 2. Each document exhibits multiple topics with corresponding intensities. 2 Thanks to Moontae Lee and David Mimno for providing the first draft of this assignment and the data set. Any errors you find were probably subsequently introduced by DSB! 1 In other words, a topic is a multinomial distribution over words. 2 Similarly, a document is represented by a multinomial distribution over topics. 1

3. Each word in a document is generated with respect to its underlying topic. Since only words in the documents are observable, our goal is to infer which words comprise each topic (i.e., topic-word distributions) and which topics comprise each document (i.e., document-topic distributions). The most popular approach to tackle this problem is called the Latent Dirichlet Allocation (LDA) 3. We will use a less expensive approach based on anchor words 4. Our model begins with the word-word co-occurrence matrix Q, which counts how likely each word jointly occurs with other words in the documents. Denote the number of distinctive words in the documents by V and the (predefined) number of topics by K. Our goal is to factorize this V V matrix Q into a multiplication of several non-negative matrices so that we can parse hidden thematic structures from the resulting decomposition Q ARA T, (1) where A (R + ) V K is the word-topic matrix and R (R + ) K K is the topic-topic matrix. An entry a ik of the word-topic matrix can be thought of as probabilities of a word i given a topic k, or the relative intensity of word i to the topic k. An entry r ij of the topic-topic matrix represents how much topic i is correlated with topic j. Our goal in this assignment is to recover A given Q. 3 Inference Unfortunately, factoring Q as in (1) is not easy because we enforce the constraint that the columns of the word-topic matrix represent probability distributions (so are non-negative and sum to one). To simplify the computation, we make an assumption: { A ik > 0 Assumption 1. For each topic k, there is an anchor word i such that A ik = 0 ( k k) Under this assumption, the inference of our model consists of the following steps: 1. Find the set of K anchor words given the matrix Q. (Say S = {s 1,..., s K }) 2. Recover the word-topic matrix A with respect to the S. 3.1 Finding anchor words Under the anchor-word assumption, every row vector of the row-normalized word-word co-occurrence matrix Q lies in the convex hull of the rows corresponding to anchor words. We find these representative rows with Algorithm 1 5. The key step in this algorithm is the computation of the the (size of) the projection of all points onto the orthogonal complement of a current subspace span(s), which we can accomplish using either the Gram-Schmidt process or Householder transforms. 2

Algorithm 1 Find Anchor Words In: the row-normalized word-word co-occurrence matrix Q, the number of topics K Out: the set of anchor words S def Find-Anchors( Q, K) S { q i } where q i is the farthest point from the origin. for k = 2 to K do q i the point farthest from span(s). S S q i end for return S Notation: q i and q j indicate some row vectors of the input matrix Q 3.2 Recovering the word-topic matrix Given the set of anchor words S = {s 1,..., s K }, now we need to figure out how to best approximate each non-anchor word vector as a convex combination of the anchor word vectors. Recall S was a greedy approximation, and the real data does not necessarily follow the anchor-word assumption. Thus the convex coefficients for each non-anchor word must be determined toward minimizing the gap between the original non-anchor word vector and the reconstructed convex combination. That is, for each word i, { Find (c i1,..., c ik ) [0, 1] K minimizes q i k S such that c ik q sk 2 2 subject to k S c (2) ik = 1 and c ik 0. Due to the constraints, this optimization is more complicated than an ordinary least squares fit; but it can be solved using Algorithm 2. Note that c ik computed above represents the conditional probability of the topic k given the word i. By using Bayes theorem, a ik = p(word = i topic = k) = p(topic = k word = i) p(word = i) i p(topic = k word = i ) p(word = i ) = c ikw i j c jkw j (3) where w j = p(word = i) = j p(word 1 = i word 2 = j) = j q ij. (4) 4 Dataset We provide one data set in the repository, computed from the Yelp review corpus consisting of 20K documents and 1573 distinct words. This data set consists of a word co-occurrence matrix and a 3 See this introduction: https://www.cs.princeton.edu/~blei/papers/blei2011.pdf 4 See http://jmlr.org/proceedings/papers/v28/arora13.pdf for details. 5 For numerical linear algebra afficionados, this is equivalent to QR with column pivoting. 3

Algorithm 2 Constrained Minimization In: The V K matrix T, V 1 column-vector b Out: the set of convex coefficients x K that minimizes T x b 2 def Minimize-Gap(T, b) x (1/K,..., 1/K) so that K k=1 x k = 1 isconverge F alse while not isconverge do p T x b 2 2 = 2T T (T x b) R K compute the gradient for k = 1 to K do x k x k e ηtp k perform a component-wise multiplicative update end for x x/ x 1 projection onto the simplex p T x b 2 2 recompute the gradient if (p p min 1)T x < ɛ then end if end while return x isconverge T rue Notation: K indicates the simplex and p min indicates the minimum component of p. test a convergence list of words. You can run the inference on the Yelp data set by running the script driver.jl; the anchor words and the top twenty words for each topic will be written file topics.txt. You may also want to try some other data sets from the UCI machine learning data set repository that involve larger vocabularies. To download the relevant data, run make fetch_kos make fetch_nips make fetch_enron You can compute a corresponding topics word list by running (for example) include("anchor_words.jl") driver_uci("kos") at the Julia prompt. 5 Your assignment 1. You are given a reference implementation that solves the constrained minimization problem using an accurate but expensive algorithm. Make sure you can run this implementation on the compute nodes. On C4, try this: cd cs5220/topics module load julia csub julia driver.jl 4

Take note of the time required for the different phases of the computation. You may also want to try a somewhat larger data set. 2. Implement the exponentiated gradient algorithm (Algorithm 2) as an alternative to the current simplex strategy. You may want to bound the number of iterations when you see slow convergence. This involves filling in the simplex nnls eg routine in the code. How much does this improve the speed? If you are feeling ambitious and know something about this subject, you may also choose to implement an alternative algorithm. 3. Parallelize the algorithm. I recommend focusing on the most expensive computation in compute A. This is embarrassingly parallel, but you may find that you need to think about data locality. For the purposes of this assignment, I recommend submitting your Julia script with ompsub in order to reserve the number of processors you need; but if you decide you want more than the eight processors available on one node, you can use start Julia on the front node and run the addprocs htc command to launch jobs on the remote nodes. 4. Analyze the performance of the parallel algorithm. Where are the bottlenecks? What sort of parallel speedup do you see? How fast can you get the algorithm to go? You may also want to comment on the tradeoffs involved in a sloppy solve. Your final submission should include: A modified version of the anchor topics.jl code. The topics.txt file produced by your final implementation A writeup. Your writeup must include a speedup plot (speedup vs number of processors) with at least seven processors and a description of bottlenecks (observed via the profiler or tic/toc). Make sure you also report the time taken for phases of the code that you have not parallelized (including I/O). 5