ALGORITHMS FOR DICTIONARY LEARNING

Size: px

Start display at page:

Download "ALGORITHMS FOR DICTIONARY LEARNING"

Patrick Hicks
5 years ago
Views:

1 ALGORITHMS FOR DICTIONARY LEARNING ANKUR MOITRA MASSACHUSETTS INSTITUTE OF TECHNOLOGY

2 SPARSE REPRESENTATIONS

3 SPARSE REPRESENTATIONS Many data- types are sparse in an appropriately chosen basis:

4 SPARSE REPRESENTATIONS Many data- types are sparse in an appropriately chosen basis: b i e.g. images, signals, data (n p)

5 SPARSE REPRESENTATIONS Many data- types are sparse in an appropriately chosen basis: dicoonary (n m) A x i b i e.g. images, signals, representaoons (m p) data (n p)

6 SPARSE REPRESENTATIONS Many data- types are sparse in an appropriately chosen basis: dicoonary (n m) at most k non- zeros A x i b i e.g. images, signals, representaoons (m p) data (n p)

7 SPARSE REPRESENTATIONS Many data- types are sparse in an appropriately chosen basis: dicoonary (n m) at most k non- zeros A x i b i Dic&onary Learning: Can we learn A from examples? e.g. images, signals, representaoons (m p) data (n p)

8 APPLICATIONS OF DICTIONARY LEARNING a.k.a. sparse coding

9 APPLICATIONS OF DICTIONARY LEARNING a.k.a. sparse coding Signal Processing/Sta&s&cs: De- noising, edge- detecoon, super- resoluoon Block compression for images/video

10 APPLICATIONS OF DICTIONARY LEARNING a.k.a. sparse coding Signal Processing/Sta&s&cs: De- noising, edge- detecoon, super- resoluoon Block compression for images/video Machine Learning: Sparsity as a regularizer to prevent over- fi[ng Learned sparse reps. play a key role in deep- learning

11 APPLICATIONS OF DICTIONARY LEARNING a.k.a. sparse coding Signal Processing/Sta&s&cs: De- noising, edge- detecoon, super- resoluoon Block compression for images/video Machine Learning: Sparsity as a regularizer to prevent over- fi[ng Learned sparse reps. play a key role in deep- learning Computa&onal Neuroscience (Olshausen- Field 1997): Applied to natural images yields filters with same qualitaove properoes as recepove field in V1

12 OUTLINE Are there efficient algorithms for dicoonary learning? Introduc&on Origins of Sparse Recovery A StochasOc Model; Our Results Provable Algorithms via Overlapping Clustering Uncertainty Principles ReformulaOon as Overlapping Clustering Analyzing Alterna&ng Minimiza&on Gradient Descent on Non- Convex Fctns

13 ORIGINS OF SPARSE RECOVERY Donoho- Stark, Donoho- Huo, Gribonval- Nielsen, Donoho- Elad: dicoonary (n m) A Incoherence: μ = max i j A i, A j / n x = b at most k non- zeros

14 ORIGINS OF SPARSE RECOVERY Donoho- Stark, Donoho- Huo, Gribonval- Nielsen, Donoho- Elad: dicoonary (n m) A Incoherence: μ = max i j A i, A j / n x = b at most k non- zeros for spikes- and- sines μ = 1

15 ORIGINS OF SPARSE RECOVERY Donoho- Stark, Donoho- Huo, Gribonval- Nielsen, Donoho- Elad: If k n / 2μ then x is the sparsest soluoon to the linear system, and can be found with l 1 - minimizaoon dicoonary (n m) A Incoherence: μ = max i j A i, A j / n x = b at most k non- zeros for spikes- and- sines μ = 1

16 THE FULL RANK CASE Are there efficient algorithms for dicoonary learning? Case #1: A has full column rank

17 THE FULL RANK CASE Are there efficient algorithms for dicoonary learning? Case #1: A has full column rank Theorem [Spielman, Wang, Wright 13]: There is a poly. Ome algorithm to exactly learn A when it has full column rank, for k n (hence m n)

18 THE FULL RANK CASE Are there efficient algorithms for dicoonary learning? Case #1: A has full column rank Theorem [Spielman, Wang, Wright 13]: There is a poly. Ome algorithm to exactly learn A when it has full column rank, for k n (hence m n) Approach: find the rows of A - 1, using L 1 - minimizaoon

19 THE FULL RANK CASE Are there efficient algorithms for dicoonary learning? Case #1: A has full column rank Theorem [Spielman, Wang, Wright 13]: There is a poly. Ome algorithm to exactly learn A when it has full column rank, for k n (hence m n) Approach: find the rows of A - 1, using L 1 - minimizaoon Stochas&c Model: unknown dicoonary A generate x with support of size k u.a.r., choose non- zero values independently, observe b = Ax

20 Nota&on: AX = B, where the columns of B, X are samples and their representaoons respecovely

21 Nota&on: AX = B, where the columns of B, X are samples and their representaoons respecovely Claim: row- span(b) = row- span(x)

22 Nota&on: AX = B, where the columns of B, X are samples and their representaoons respecovely Claim: row- span(b) = row- span(x) Claim: The sparsest vectors in row- span(x) (or B) are the X

23 Nota&on: AX = B, where the columns of B, X are samples and their representaoons respecovely Claim: row- span(b) = row- span(x) Claim: The sparsest vectors in row- span(x) (or B) are the X Can we find the sparsest vector in row- span(x)?

24 Nota&on: AX = B, where the columns of B, X are samples and their representaoons respecovely Claim: row- span(b) = row- span(x) Claim: The sparsest vectors in row- span(x) (or B) are the X Can we find the sparsest vector in row- span(x)? Approach #1: (P0): min w T B 0 s.t. w 0

25 Nota&on: AX = B, where the columns of B, X are samples and their representaoons respecovely Claim: row- span(b) = row- span(x) Claim: The sparsest vectors in row- span(x) (or B) are the X Can we find the sparsest vector in row- span(x)? Approach #1: NP- hard (P0): min w T B 1 s.t. w 0

26 Nota&on: AX = B, where the columns of B, X are samples and their representaoons respecovely Claim: row- span(b) = row- span(x) Claim: The sparsest vectors in row- span(x) (or B) are the X Can we find the sparsest vector in row- span(x)? Approach #2: L 1 - relaxa&on (P1): min w T B 1 s.t. w T r = 1 where we will set r later

27 (P1): min w T B 1 s.t. w T r = 1

28 (P1): min w T B 1 s.t. w T r = 1 Consider the bijecoon z = A T w, and set c = A - 1 r.

29 (P1): min w T B 1 s.t. w T r = 1 Consider the bijecoon z = A T w, and set r = Ac. We get: (P1): min w T AX 1 s.t. w T Ac = 1

30 (P1): min w T B 1 s.t. w T r = 1 Consider the bijecoon z = A T w, and set r = Ac. We get: (P1): min w T AX 1 s.t. w T Ac = 1 This is equivalent to: (Q1): min z T X 1 s.t. z T c = 1

31 (P1): min w T B 1 s.t. w T r = 1 Consider the bijecoon z = A T w, and set r = Ac. We get: (P1): min w T AX 1 s.t. w T Ac = 1 This is equivalent to: (Q1): min z T X 1 s.t. z T c = 1 Set r = column of B, then c = A - 1 r = column of X

32 (P1): min w T B 1 s.t. w T r = 1 Consider the bijecoon z = A T w, and set r = Ac. We get: (P1): min w T AX 1 s.t. w T Ac = 1 This is equivalent to: (Q1): min z T X 1 s.t. z T c = 1 Set r = column of B, then c = A - 1 r = column of X Claim: If c has a strictly largest coordinate ( c i > c j for j i) in absolute value, then whp the soln to (Q1) is e i

33 (P1): min w T B 1 s.t. w T r = 1 Consider the bijecoon z = A T w, and set r = Ac. We get: (P1): min w T AX 1 s.t. w T Ac = 1 Claim: Then the soln to (P1) is the i th row of X This is equivalent to: (Q1): min z T X 1 s.t. z T c = 1 Set r = column of B, then c = A - 1 r = column of X Claim: If c has a strictly largest coordinate ( c i > c j for j i) in absolute value, then whp the soln to (Q1) is e i

34 Nota&on: AX = B, where the columns of B, X are samples and their representaoons respecovely Claim: row- span(b) = row- span(x) Claim: The sparsest vectors in row- span(x) (or B) are the X Can we find the sparsest vector in row- span(x)? Approach #2: L 1 - relaxa&on (P1): min w T B 1 s.t. w T r = 1

35 Nota&on: AX = B, where the columns of B, X are samples and their representaoons respecovely Claim: row- span(b) = row- span(x) Claim: The sparsest vectors in row- span(x) (or B) are the X Can we find the sparsest vector in row- span(x)? Approach #2: L 1 - relaxa&on (P1): min w T B 1 s.t. w T r = 1 Hence we can find the rows of X, and solve for A

36 THE OVERCOMPLETE CASE What about overcomplete dicoonaries? (more expressive) Case #2: A is incoherent

37 THE OVERCOMPLETE CASE What about overcomplete dicoonaries? (more expressive) Case #2: A is incoherent Theorem [Arora, Ge, Moitra 13]: There is an algorithm to learn A within ε if it is n by m and μ- incoherent for k min( n/μ log n, m ½- η ) The running Ome and sample complexity are poly(n,m,log 1/ε)

38 THE OVERCOMPLETE CASE What about overcomplete dicoonaries? (more expressive) Case #2: A is incoherent Theorem [Arora, Ge, Moitra 13]: There is an algorithm to learn A within ε if it is n by m and μ- incoherent for k min( n/μ log n, m ½- η ) The running Ome and sample complexity are poly(n,m,log 1/ε) Approach: learn the support of the representaoons X = [ x ] first, by solving an overlapping clustering problem

39 THE OVERCOMPLETE CASE What about overcomplete dicoonaries? (more expressive) Case #2: A is incoherent Theorem [Arora, Ge, Moitra 13]: There is an algorithm to learn A within ε if it is n by m and μ- incoherent for k min( n/μ log n, m ½- η ) The running Ome and sample complexity are poly(n,m,log 1/ε) Approach: learn the support of the representaoons X = [ x ] first, by solving an overlapping clustering problem Theorem [Agarwal et al 13]: There is a poly. Ome algorithm to learn A if it is μ- incoherent for k n ¼ /μ

40 THE MODEL What about overcomplete dicoonaries? (more expressive) Case #2: A is incoherent

41 THE MODEL What about overcomplete dicoonaries? (more expressive) Case #2: A is incoherent Theorem [Barak, Kelner, Steurer 14]: There is a quasi- poly. Ome algorithm to learn A within any constant A if it is μ- incoherent for k n 1- η using the sum- of- squares hierarchy

42 THE MODEL What about overcomplete dicoonaries? (more expressive) Case #2: A is incoherent Theorem [Barak, Kelner, Steurer 14]: There is a quasi- poly. Ome algorithm to learn A within any constant A if it is μ- incoherent for k n 1- η using the sum- of- squares hierarchy Approach: find y that approximately maximizes E[ b T y 4 ] via a poly- logarithmic number of rounds; it is close to a coln of A

43 OUTLINE Are there efficient algorithms for dicoonary learning? Introduc&on Origins of Sparse Recovery A StochasOc Model; Our Results Provable Algorithms via Overlapping Clustering Uncertainty Principles ReformulaOon as Overlapping Clustering Analyzing Alterna&ng Minimiza&on Gradient Descent on Non- Convex Fctns

44 UNCERTAINTY PRINCIPLES

45 UNCERTAINTY PRINCIPLES Claim: Given A, b and k it is NP- hard to decide if there is a k- sparse x such that Ax = b

46 UNCERTAINTY PRINCIPLES Claim: Given A, b and k it is NP- hard to decide if there is a k- sparse x such that Ax = b Why is this easier for incoherent dicoonaries?

47 UNCERTAINTY PRINCIPLES Claim: Given A, b and k it is NP- hard to decide if there is a k- sparse x such that Ax = b Why is this easier for incoherent dicoonaries? Uncertainty Principle: If A is μ- incoherent then Ay, Ax y, x provided that x and y are k- sparse, for k n/2μ

48 UNCERTAINTY PRINCIPLES Claim: Given A, b and k it is NP- hard to decide if there is a k- sparse x such that Ax = b Why is this easier for incoherent dicoonaries? Uncertainty Principle: If A is μ- incoherent then Ay, Ax y, x provided that x and y are k- sparse, for k n/2μ Proof: A T A restricted to the support of x and y is k k and (A T A) i,j = 1 if i = j μ/ n if i j

49 UNCERTAINTY PRINCIPLES Claim: Given A, b and k it is NP- hard to decide if there is a k- sparse x such that Ax = b Why is this easier for incoherent dicoonaries? Uncertainty Principle: If A is μ- incoherent then Ay, Ax y, x provided that x and y are k- sparse, for k n/2μ Proof: A T A restricted to the support of x and y is k k and (A T A) i,j = 1 if i = j μ/ n Then use Gershgorin s Disk Thm if i j

50 UNCERTAINTY PRINCIPLES Claim: Given A, b and k it is NP- hard to decide if there is a k- sparse x such that Ax = b Why is this easier for incoherent dicoonaries? Uncertainty Principle: If A is μ- incoherent then Ay, Ax y, x provided that x and y are k- sparse, for k n/2μ

51 UNCERTAINTY PRINCIPLES Claim: Given A, b and k it is NP- hard to decide if there is a k- sparse x such that Ax = b Why is this easier for incoherent dicoonaries? Uncertainty Principle: If A is μ- incoherent then Ay, Ax y, x provided that x and y are k- sparse, for k n/2μ This principle can be used to establish uniqueness for sparse recovery, and things like b cannot be sparse in both standard and Fourier basis

52 A PAIR- WISE TEST

53 A PAIR- WISE TEST Given Ax = b and Ax = b, do x and x have intersecoon support?

54 A PAIR- WISE TEST Given Ax = b and Ax = b, do x and x have intersecoon support? supp(x) = supp(x ) =

55 A PAIR- WISE TEST Given Ax = b and Ax = b, do x and x have intersecoon support? supp(x) = supp(x ) =

56 A PAIR- WISE TEST Given Ax = b and Ax = b, do x and x have intersecoon support? supp(x) = supp(x ) = x', x zero non- zero maybe yes

57 A PAIR- WISE TEST Given Ax = b and Ax = b, do x and x have intersecoon support? supp(x) = supp(x ) = x', x zero non- zero maybe yes x', x Ax, Ax Uncertainty Principle: for k- sparse x, incoherent A

58 A PAIR- WISE TEST Given Ax = b and Ax = b, do x and x have intersecoon support? supp(x) = supp(x ) = x', x zero non- zero maybe yes x', x Ax, Ax zero non- zero maybe yes, whp Uncertainty Principle: for k- sparse x, incoherent A

59 A PAIR- WISE TEST Given Ax = b and Ax = b, do x and x have intersecoon support? supp(x) = supp(x ) = Approach: Build a graph G on the p samples, with an edge btwn b and b if and only if b T b > 1/2

60 A PAIR- WISE TEST Given Ax = b and Ax = b, do x and x have intersecoon support? supp(x) = supp(x ) = Approach: Build a graph G on the p samples, with an edge btwn b and b if and only if b T b > 1/2 For the purposes of this talk, probability of an edge between b, b is ½ iff supp(x) and supp(x ) intersect

61 OVERLAPPING CLUSTERING Let C i ={ b x i 0} (overlapping)

62 OVERLAPPING CLUSTERING Let C i ={ b x i 0} (overlapping) Can we find the clusters efficiently?

63 OVERLAPPING CLUSTERING Let C i ={ b x i 0} (overlapping) Can we find the clusters efficiently? Challenge: Given (x, x, x ) where all the pairs belong to a cluster together, do all three belong to a common cluster too? supp(x) = supp(x ) = supp(x ) =

64 OVERLAPPING CLUSTERING Let C i ={ b x i 0} (overlapping) Can we find the clusters efficiently? Challenge: Given (x, x, x ) where all the pairs belong to a cluster together, do all three belong to a common cluster too?

65 OVERLAPPING CLUSTERING Let C i ={ b x i 0} (overlapping) Can we find the clusters efficiently? Challenge: Given (x, x, x ) where all the pairs belong to a cluster together, do all three belong to a common cluster too? supp(x) = supp(x ) = supp(x ) =

66 A TRIPLE TEST Key Idea: Use new samples y

67 A TRIPLE TEST Key Idea: Use new samples y Case #1: all three intersect: supp(x) = supp(x ) = supp(x ) =

68 A TRIPLE TEST Key Idea: Use new samples y Case #1: all three intersect: Probability y intersects all three is at least k/m supp(x) = supp(x ) = supp(x ) = New sample y only needs to contain one element from their joint union

69 A TRIPLE TEST Key Idea: Use new samples y

70 A TRIPLE TEST Key Idea: Use new samples y Case #2: no common intersecoon supp(x) = supp(x ) = supp(x ) = New sample y needs to contain at least two elements from their joint union

71 A TRIPLE TEST Key Idea: Use new samples y Case #2: no common intersecoon, supp(x) supp(x ) C, etc Probability y intersects all three is at most O(Ck 3 /m 2 ) supp(x) = supp(x ) = supp(x ) = New sample y needs to contain at least two elements from their joint union

72 A TRIPLE TEST Key Idea: Use new samples y Case #1: all three intersect: Probability y intersects all three is at least k/m Case #2: no common intersecoon, supp(x) supp(x ) C, etc Probability y intersects all three is at most O(Ck 3 /m 2 )

73 A TRIPLE TEST Key Idea: Use new samples y Case #1: all three intersect: Probability y intersects all three is at least k/m Case #2: no common intersecoon, supp(x) supp(x ) C, etc Probability y intersects all three is at most O(Ck 3 /m 2 ) Triple Test: Given (x, x, x ) where all the pairs intersect If there are at least T samples y where (x, x, x, y) all pairwise intersect, ACCEPT else REJECT

74 FINDING ALL THE CLUSTERS We can build a clustering algorithm on this primiove: For each pair (x, x ), find all x that pass the triple test

75 FINDING ALL THE CLUSTERS We can build a clustering algorithm on this primiove: For each pair (x, x ), find all x that pass the triple test Claim: This set is the union of clusters corresponding to supp(x) supp(x )

76 FINDING ALL THE CLUSTERS We can build a clustering algorithm on this primiove: For each pair (x, x ), find all x that pass the triple test Claim: This set is the union of clusters corresponding to supp(x) supp(x ) Claim: For every cluster i, there is some x, x that uniquely idenofy it i.e. supp(x) supp(x ) = {i}

77 FINDING ALL THE CLUSTERS We can build a clustering algorithm on this primiove: For each pair (x, x ), find all x that pass the triple test Claim: This set is the union of clusters corresponding to supp(x) supp(x ) Claim: For every cluster i, there is some x, x that uniquely idenofy it i.e. supp(x) supp(x ) = {i} Output inclusion- wise minimal sets these are the clusters!

78 FINDING ALL THE CLUSTERS We can build a clustering algorithm on this primiove: For each pair (x, x ), find all x that pass the triple test Claim: This set is the union of clusters corresponding to supp(x) supp(x ) Claim: For every cluster i, there is some x, x that uniquely idenofy it i.e. supp(x) supp(x ) = {i} Output inclusion- wise minimal sets these are the clusters! Our full algorithm uses higher- order tests; analysis through connec&ons to piercing number

79 Many ways to get the dic&onary from the clustering

80 Many ways to get the dic&onary from the clustering Approach #1: RelaOve Signs Plan: Refine C i and find all the b s with x i > 0

81 Many ways to get the dic&onary from the clustering Approach #1: RelaOve Signs Plan: Refine C i and find all the b s with x i > 0 Intui&on: If supp(x) supp(x ) = {i}, the we can find relaove sign of x i and x i and there are many such pairs

82 Many ways to get the dic&onary from the clustering Approach #1: RelaOve Signs Plan: Refine C i and find all the b s with x i > 0 Intui&on: If supp(x) supp(x ) = {i}, the we can find relaove sign of x i and x i and there are many such pairs enough so that whp we can find all relaove signs by transi&vity

83 Many ways to get the dic&onary from the clustering Approach #1: RelaOve Signs Plan: Refine C i and find all the b s with x i > 0 Intui&on: If supp(x) supp(x ) = {i}, the we can find relaove sign of x i and x i and there are many such pairs enough so that whp we can find all relaove signs by transi&vity Claim: E[b Ax = b and x i > 0] = A i E[x i x i > 0] Hence their empirical average converges to A i

84 Many ways to get the dic&onary from the clustering

85 Many ways to get the dic&onary from the clustering Approach #2: SVD Suppose we restrict to samples b with x i 0.

86 Many ways to get the dic&onary from the clustering Approach #2: SVD Suppose we restrict to samples b with x i 0. Intui&on: E[bb T x i 0] has large variance in direcoon of A i

87 Many ways to get the dic&onary from the clustering Approach #2: SVD Suppose we restrict to samples b with x i 0. Intui&on: E[bb T x i 0] has large variance in direcoon of A i We also show that alternaong minimizaoon works when we re close enough (geometric convergence)

88 OUTLINE Are there efficient algorithms for dicoonary learning? Introduc&on Origins of Sparse Recovery A StochasOc Model; Our Results Provable Algorithms via Overlapping Clustering Uncertainty Principles ReformulaOon as Overlapping Clustering Analyzing Alterna&ng Minimiza&on (out of &me) Gradient Descent on Non- Convex Fctns

89 A CONCEPTUAL OVERVIEW smoothed analysis Overcomplete Models computaoonal geometry NonnegaOve Matrix FactorizaOon Tensor DecomposiOons Be{er EsOmators spectral methods New Models Is Learning ComputaOonally Easy? Topic Models experiments Method of Moments polynomial equaoons e.g. mixtures of Gaussians

90 A CONCEPTUAL OVERVIEW smoothed analysis Overcomplete Models computaoonal geometry NonnegaOve Matrix FactorizaOon Tensor DecomposiOons Be{er EsOmators spectral methods New Models Is Learning ComputaOonally Easy? Topic Models experiments Natural Algorithms Method of Moments polynomial equaoons convex opomizaoon DicOonary Learning e.g. mixtures of Gaussians

91 Summary: Provable algorithms for learning incoherent, overcomplete dicoonaries ConnecOons to overlapping clustering Analysis of alternaong minimizaoon gradient descent on non- convex objecove Why does it work even from a random ini&aliza&on?

92 Any QuesOons? Summary: Provable algorithms for learning incoherent, overcomplete dicoonaries ConnecOons to overlapping clustering Analysis of alternaong minimizaoon gradient descent on non- convex objecove Why does it work even from a random ini&aliza&on?

NEW ALGORITHMS FOR DICTIONARY LEARNING

NEW ALGORITHMS FOR DICTIONARY LEARNING ANKUR MOITRA MASSACHUSETTS INSTITUTE OF TECHNOLOGY joint work with Sanjeev Arora, Rong Ge and Tengyu Ma SPARSE REPRESENTATIONS SPARSE REPRESENTATIONS Many data- types