Convex Optimization / Homework 2, due Oct 3

Similar documents
Convex Optimization MLSS 2015

1. Introduction. performance of numerical methods. complexity bounds. structural convex optimization. course goals and topics

General Instructions. Questions

Convexity I: Sets and Functions

Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization. Author: Martin Jaggi Presenter: Zhongxing Peng

Lecture 18: March 23

A More Efficient Approach to Large Scale Matrix Completion Problems

Convex optimization algorithms for sparse and low-rank representations

CSE 547: Machine Learning for Big Data Spring Problem Set 2. Please read the homework submission policies.

Lecture 2: August 31

Proximal operator and methods

Convexity Theory and Gradient Methods

Sparse Optimization Lecture: Proximal Operator/Algorithm and Lagrange Dual

Limitations of Matrix Completion via Trace Norm Minimization

Lecture 19: Convex Non-Smooth Optimization. April 2, 2007

The Singular Value Decomposition: Let A be any m n matrix. orthogonal matrices U, V and a diagonal matrix Σ such that A = UΣV T.

Homework 1 (a and b) Convex Sets and Convex Functions

Lab # 2 - ACS I Part I - DATA COMPRESSION in IMAGE PROCESSING using SVD

CS 231A: Computer Vision (Winter 2018) Problem Set 2

Section 5 Convex Optimisation 1. W. Dai (IC) EE4.66 Data Proc. Convex Optimisation page 5-1

ELEG Compressive Sensing and Sparse Signal Representations

Name: Math 310 Fall 2012 Toews EXAM 1. The material we have covered so far has been designed to support the following learning goals:

Optimization for Machine Learning

Alternating Projections

Aspects of Convex, Nonconvex, and Geometric Optimization (Lecture 1) Suvrit Sra Massachusetts Institute of Technology

Lecture 3: Camera Calibration, DLT, SVD

CSC 411 Lecture 18: Matrix Factorizations

Machine Learning and Computational Statistics, Spring 2016 Homework 1: Ridge Regression and SGD

Lecture 19: November 5

Lecture 2: August 29, 2018

Machine Learning and Computational Statistics, Spring 2015 Homework 1: Ridge Regression and SGD

Convex Optimization Lecture 2

Optimization for Machine Learning

Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please)

An accelerated proximal gradient algorithm for nuclear norm regularized least squares problems

Horn-Schunck and Lucas Kanade 1

CSE 546 Machine Learning, Autumn 2013 Homework 2

Unsupervised learning in Vision

Robust Principal Component Analysis (RPCA)

1. Techniques for ill-posed least squares problems. We write throughout this lecture = 2. Consider the following least squares problems:

PIPA: A New Proximal Interior Point Algorithm for Large-Scale Convex Optimization

Parallel and Distributed Sparse Optimization Algorithms

Computer Vision: Lecture 3

Lecture 2: August 29, 2018

Lecture 2 September 3

60 2 Convex sets. {x a T x b} {x ã T x b}

Lecture 4: Convexity

Convexization in Markov Chain Monte Carlo

Projection-Based Methods in Optimization

Divide and Conquer Kernel Ridge Regression

Today. Gradient descent for minimization of functions of real variables. Multi-dimensional scaling. Self-organizing maps

A low rank based seismic data interpolation via frequencypatches transform and low rank space projection

Iteratively Re-weighted Least Squares for Sums of Convex Functions

AMS : Combinatorial Optimization Homework Problems - Week V

Computational Machine Learning, Fall 2015 Homework 4: stochastic gradient algorithms

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 2. Convex Optimization

Lecture 2 - Introduction to Polytopes

CIS 580, Machine Perception, Spring 2014: Assignment 4 Due: Wednesday, April 10th, 10:30am (use turnin)

CS264: Homework #4. Due by midnight on Wednesday, October 22, 2014

CS446: Machine Learning Fall Problem Set 4. Handed Out: October 17, 2013 Due: October 31 th, w T x i w

Convexity: an introduction

Generalized trace ratio optimization and applications

A Course in Machine Learning

Characterizing Improving Directions Unconstrained Optimization

CS 231A Computer Vision (Winter 2015) Problem Set 2

1 Training/Validation/Testing

Detecting Burnscar from Hyperspectral Imagery via Sparse Representation with Low-Rank Interference

Algebraic Iterative Methods for Computed Tomography

Conditional gradient algorithms for machine learning

Advanced Operations Research Techniques IE316. Quiz 1 Review. Dr. Ted Ralphs

ACCELERATED DUAL GRADIENT-BASED METHODS FOR TOTAL VARIATION IMAGE DENOISING/DEBLURRING PROBLEMS. Donghwan Kim and Jeffrey A.

When Sparsity Meets Low-Rankness: Transform Learning With Non-Local Low-Rank Constraint For Image Restoration

6. Linear Discriminant Functions

Package filling. December 11, 2017

Applied Lagrange Duality for Constrained Optimization

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

A Brief Look at Optimization

Optimization Plugin for RapidMiner. Venkatesh Umaashankar Sangkyun Lee. Technical Report 04/2012. technische universität dortmund

Advanced Techniques for Mobile Robotics Graph-based SLAM using Least Squares. Wolfram Burgard, Cyrill Stachniss, Kai Arras, Maren Bennewitz

Two-view geometry Computer Vision Spring 2018, Lecture 10

Algebraic Iterative Methods for Computed Tomography

IE 521 Convex Optimization

Introduction to Modern Control Systems

On the Parameter Estimation of the Generalized Exponential Distribution Under Progressive Type-I Interval Censoring Scheme

Online Algorithm Comparison points

Unconstrained Optimization

CS264: Homework #1. Due by midnight on Thursday, January 19, 2017

Week 5. Convex Optimization

AM 221: Advanced Optimization Spring 2016

Collaborative Filtering for Netflix

Recent Developments in Model-based Derivative-free Optimization

Mathematical Programming and Research Methods (Part II)

This lecture: Convex optimization Convex sets Convex functions Convex optimization problems Why convex optimization? Why so early in the course?

Homework 2. Due: March 2, 2018 at 7:00PM. p = 1 m. (x i ). i=1

Least Squares and SLAM Pose-SLAM

CS294-1 Assignment 2 Report

Functions. How is this definition written in symbolic logic notation?

2. Linear Regression and Gradient Descent

Supplementary Material : Partial Sum Minimization of Singular Values in RPCA for Low-Level Vision

EXERCISES SHORTEST PATHS: APPLICATIONS, OPTIMIZATION, VARIATIONS, AND SOLVING THE CONSTRAINED SHORTEST PATH PROBLEM. 1 Applications and Modelling

Transcription:

Convex Optimization 0-725/36-725 Homework 2, due Oct 3 Instructions: You must complete Problems 3 and either Problem 4 or Problem 5 (your choice between the two) When you submit the homework, upload a single PDF (eg, produced by LaTeX, or scanned handwritten exercises) for the solution of each problem separately, to blackboard You should your name at the top of each file, except for the first problem Your solution to Problem (mastery set) should appear completely anonymous to a reader Mastery set [25 points] (Aaditya) Be very concise If written up and scanned, should be less than two sides in total for the whole mastery set If Latex-ed up, should be less than one side A [2 + 2] Let S be the set of all minimizers of a convex function f Prove that S is convex Now suppose that f is strictly convex Prove that S contains only one element, ie, f has a unique minimizer B [2 + 2] What are the singular values of an n n square orthogonal matrix? (justify in one line) Show with a short example that the set of all orthogonal matrices is non-convex C [2 + 2 + 2] Show that every convex combination of orthogonal matrices has spectral norm at most In the second recitation, we showed that the spectral norm A op is a convex function of A Here, prove that the unit spectral norm ball {A : A op } is a convex set Is the spectral norm a Lipschitz function with respect to the spectral norm, ie is A op B op L A B op for some L? (Hint: use properties of norms from first recitation)

D [5] Use the above part to show that the convex hull of orthogonal matrices is the unit spectral norm ball (hint: the reverse direction still needs to be shown) E [4+2] For a differentiable λ-strongly convex function, starting from the Taylor-like first order definition, prove that f(y) f(x) 2 λ y x 2 (hint: proof is four lines long, plenty of hints in the second recitation) If additionally, its gradient is L-Lipschitz, what can we say about λ, L? 2 Subgradients of matrix norms [30 points, 8+9+9+4] (Yifei) In this problem, you ll consider two types of matrix norms: the trace norm (also called the nuclear norm) and the operator norm (also called the spectral norm) Recall that for a matrix A, its trace norm is A = r i= σ i(a), the sum of the singular values of A, and its operator norm is A op = σ i (A), the largest singular value of A An important fact to know is that, in the matrix world, the inner product between matrices A, B is given by tr(b T A), where tr( ) is the trace operator (sum of diagonal elements) [To convince yourself of this, think about unraveling A, B as vectors and performing a usual inner product check that this matches tr(b T A)] Now assume that A R m n has rank r and singular value decomposition A = UΣV T, where U R m r, Σ R r r, V R n r Parts (a) and (b) concern the subgradients of the trace norm evaluated at A (we used these subgradients in lecture to derive a proximal gradient algorithm for matrix completion) Parts (c) and (d) concern the subgradients of the operator norm at A (a) [8 points] Prove that for any W R m n with W op, U T W = 0, W V = 0, we have UV T + W op (Hint: look at the singular value decomposition of UV T + W ) (b) [4 points] Prove that for any such W as in (a), we have tr((uv T + W ) T A) = tr(σ) (c) [4 points] Prove that the subdifferential of the trace norm evaluated at A satisfies A {UV T + W : W R m n, W op, U T W = 0, W V = 0} [Hint: it will be helpful to use the fact that the dual of the trace norm is the operator norm, ie, A = max B tr(bt A) op Now recall the rule for subgradients of functions defined via a max operation] 2

Note: the above is actually an equality, not a containment, for A Proving the reverse direction is only a little bit more tricky (d) [0 points] Using the same strategy that was used in parts (a) (c), prove that A op conv ( {u j v T j : Σ jj = Σ } ), where again we use A = UΣV T to denote the SVD of A, and u j, v j are the jth columns of U, V, respectively (Note that Σ is the largest entry in Σ) [Hint: as before, use the dual relationship between the trace and operator norms, A op = max B tr(bt A) and the rule for subgradients of functions defined by a maximum] (e) [4 points] The result in (d) is actually an equality for A op, though again, we will not prove the reverse containment since it is a little more tricky Assuming this fact, when does the subdifferential A op contain a single element? Hence what can you conclude about the differentiability of the operator norm at a matrix A? 3 Algorithms for matrix completion [30 points] (Sashank) In this problem we will compare generalized gradient descent and sub-gradient descent on the task of matrix completion Given a partially observed matrix Y, we can formulate matrix completion as optimizing the following objective: B λ = argmin B R m n 2 (i,j) Ω (Y ij B ij ) 2 + λ B () = argmin B R m n 2 P Ω(Y B) 2 F + λ B, (2) where Y R m n is the partially observed matrix, Ω {, m} {, n} is the observed entry set, is the trace norm, and P Ω ( ) : R m n R m n is the projection operator onto the observed set Ω: { X ij (i, j) Ω [P Ω (X)] ij = 0 (i, j) / Ω We are going to compare different algorithms for optimizing the above objective on a subset of the MovieLens dataset The MovieLens dataset consists of a large, sparse matrix Y, where the rows and columns of Y represent users and movies respectively Specifically, the entry Y ij represents the rating of user i for movie j Since not all users rated all movies, many 3

entries of Y are missing Your task in this question is to complete the movie ratings matrix Y For detailed analysis, we actually provide two matrices Y : a training matrix Y tr, and test matrix Y te You can think of the two matrices as two different subsamples from some ground-truth full matrix We will use Y tr to learn a completed matrix B λ for each value of λ, and then use the error between Y te and B λ to choose the best λ In other words, we will choose a λ which generalizes well to new incomplete information of the same kind Let PΩ te be the projection onto the observed entry set for the test matrix, and k the number of observed entries in Y te The test error (RMSE) used to choose λ is defined as follows: RMSE = P te Ω (Y te B λ ) F k Download the data from http://wwwstatcmuedu/~ryantibs/convexopt/homeworks/ movie_datazip (a) [0] Recall the soft-impute algorithm discussed in lecture, with updates: B + = S λ ( PΩ (Y tr ) + P Ω (B) ), where S λ ( ) is the matrix soft-thresholding operator, and PΩ = I P Ω, the projector onto the unobserved set Recall that this is just proximal gradient descent with a fixed step size t = Implement the soft impute function in MATLAB or R with Λ = logspace(0, 3, 30) You can just implement a naive function for matrix soft-thresholding, which just computes the entire SVD and uses it for thresholding For each of the 30 values of λ, run softimpute on training data until convergence, starting from B = 0 The stopping criteria is f k+ f k f k < 0 4 where f k is the objective function value at k th iteration or maximum of 500 iterations (whichever occurs first) (b) [5] Record and plot the number of iterations it took to converge at each value of log(λ) Plot the RMSE error on training and test data versus log(λ) What value of λ would you choose, and what is the rank of the corresponding solution? (c) [5] Now across the 30 values of λ, again run soft-impute until convergence, but this time using warm starts ie, at each successive value of λ (in sorted order, from largest to smallest) start the algorithm at the previously computed solution Again record the number of iterations required to converge at each λ value Did the number of iterations change in comparison to without warm start? If so, then why, roughly speaking, do warm starts work? (d) [0] Derive the sugradient method steps for the optimization problem, and implement it Run soft impute and subgradient method for 500 iterations and compare their objective function values over Λ = {, 5, 0} For the subgradient method, use the step size t k = /k for the k th iteration 4

Your comparison should involve three figures, one for each λ value In each, plot f k f min versus the iteration number k for both the subgradient and soft impute methods Here f min is the minimum value of the objective function found overall (by either the subgradient or soft impute method), and f k is the current value of the criterion at the kh iteration (for the subgradient method, since it is not a descent method, this is the best value seen up until the kth iteration) It will be helpful to put the y-axis in these figures on a log scale Which algorithm performed better? 4 Convergence rate of generalized gradient descent [5 points] (Adona) Recall that the generalized gradient descent method for minimizing a composite function f(x) = g(x) + h(x), for convex, differentiable g : R n R and convex h : R n R, begins with an initial point x (0) R n, and repeats x (k) = prox tk ( x (k ) t k g(x (k ) ) ) (3) Here we will assume that g is Lipschitz with constant L > 0, g(x) g(y) 2 L x y 2 all x, y R n, and we will prove that by taking a fixed step size t k = t /L for all k =, 2, 3,, the generalized gradient algorithm has the exact same convergence guarantees as does gradient descent It will be helpful to define the generalized gradient G t of f so that the updates (3) look like gradient descent updates This is prox t ( x t g(x) ) = x tgt (x), (4) ie, G t (x) = x prox ( ) t x t g(x) t (a) [2] Show that g being Lipschitz with constant L implies for all x, y f(y) g(x) + g(x) T (y x) + L 2 y x 2 2 + h(y), (b) [2] By plugging in y = x + = x tg t (x) into the result from (a), and letting t /L, show that f(x + ) g(x) t g(x) T G t (x) + t 2 G t(x) 2 2 + h ( x tg t (x) ) 5

(c) [4] Arguing directly from (4), use the definition of the proximal operator, prox t (x) = argmin z R n 2t x z 2 2 + h(z), and the zero subgradient characterization of optimality, to show that G t (x) g(x) h ( x tg t (x) ) (d) [3] Starting from the bound you had in (b), use the result of part (c), and the convexity of g, h around an arbitrary z, to show that for all z f(x + ) f(z) + G t (x) T (x z) t 2 G t(x) 2 2, (e) [2] Take z = x in (d) to verify that the generalized gradient descent algorithm decreases the criterion f at each iteration Take z = x, a minimizer of f, to yield f(x + ) f(x ) + 2t( x x 2 2 x + x 2 2) (f) [2] Complete the proof, following the same strategy as in gradient descent, to conclude f(x (k) ) f(x ) x(0) x 2 2 2tk 5 Accelerating matrix completion [5 points] (Adona) In this problem we will continue exploring the matrix completion algorithms of Q3, in particular how they combine with acceleration and backtracking, and how they behave on a realistic dataset (a) [7] Implement the generalized gradient descent (soft-impute) algorithm of Q3(a) with acceleration Run both the non-accelerated and accelerated versions for different λs in Λ = logspace(0, 3, 30), and plot the number of iterations the two algorithms took to converge for each value of λ, starting from B = 0 As before, run the algorithms until either f k+ f k f k < 0 4 (where f k is the objective function value at k th iteration), or after a maximum of 500 iterations (whichever occurs first) Further, for fixed λ = 0, plot and compare the value of the objective at each iteration # with and without acceleration What do you observe? Does acceleration help for this problem? (b) [2] Do the same as in part (a), but now using warm starts Is there any difference in the comparison between no acceleration and acceleration? 6

(c) [6] Next we will explore how matrix completion performs as an algorithm for image reconstruction Download the image of Mona Lisa from http://wwwstatcmuedu/~ryantibs/ convexopt/homeworks/mona_bwjpg Construct a test image by randomly subsampling 50% of the pixels, and setting the remaining pixels to zero Run matrix completion with accelerated generalized gradient descent on the subsampled image for a few (0-5) different λs in the range 0 2 0 2 What do you observe? Show the original image, the subsampled image, and 3-4 reconstructions across the range of λs Which λ returns the best results? Repeat this experiment at 20% subsampling level, and show the best reconstructed image What λ did you choose in this case, and was it different from the best λ at 50% subsampling level? 7