Active Clustering and Ranking

Size: px

Start display at page:

Download "Active Clustering and Ranking"

Abner Russell
6 years ago
Views:

1 Active Clustering and Ranking Rob Nowak, University of Wisconsin-Madison IMA Workshop on "High-Dimensional Phenomena" (9/26-30, 2011) Gautam Dasarathy Brian Eriksson (Madison/Boston) Kevin Jamieson Aarti Singh (CMU)

2 Clustering Using Pairwise Similarities Clustering Problem: Consider a set of n objects x 1,..., x n. To cluster the objects into meaningful groups, we can ask questions of the form how similar is x i to x j? for i, j {1,..., n}. The goal is to cluster the objects by asking as few questions as possible. s ij j i Metric Data: s ij 1 dist(i, j) Standard ( approaches (k-means, spectral clustering, etc) assume the full set of n ) 2 similarities are known, but in high-dimensional settings (n large) similarities may be missing and/or costly to collect/compute.

3 Clustering Using Pairwise Similarities Clustering Problem: Consider a set of n objects x 1,..., x n. To cluster the objects into meaningful groups, we can ask questions of the form how similar is x i to x j? for i, j {1,..., n}. The goal is to cluster the objects by asking as few questions as possible. s ij j i Metric Data: s ij 1 dist(i, j) Standard ( approaches (k-means, spectral clustering, etc) assume the full set of n ) 2 similarities are known, but in high-dimensional settings (n large) similarities may be missing and/or costly to collect/compute. However, the existence of clusters can induce redundancy into the similarities, and therefore it may be possible to robustly cluster based on a small subset.

4 Ranking Based on Pairwise Comparisons Ranking Problem: Consider a set of n objects x 1,..., x n R d. The locations of x,..., x n 1 are known, but location of x n is unknown. To gather information about x n, we can only ask questions of the form is object x n closer to x i than x j? The goal is to rank x 1,..., x n 1 relative to distances to x n by asking as few questions as possible. δ j j i δ i n Ordinal Data: 1(δ i < δ j ) Standard ranking methods assume the full set of ( n 2) comparisons are known, but in high-dimensional settings (n large) comparisons may be missing and/or costly to collect.

5 Ranking Based on Pairwise Comparisons Ranking Problem: Consider a set of n objects x 1,..., x n R d. The locations of x,..., x n 1 are known, but location of x n is unknown. To gather information about x n, we can only ask questions of the form is object x n closer to x i than x j? The goal is to rank x 1,..., x n 1 relative to distances to x n by asking as few questions as possible. δ j j i δ i n Ordinal Data: 1(δ i < δ j ) Standard ranking methods assume the full set of ( n 2) comparisons are known, but in high-dimensional settings (n large) comparisons may be missing and/or costly to collect. However, many comparisons are redundant because the objects embed in R d, and therefore it may be possible to correctly rank based on a small subset.

Significant burden on network resources phylogenetics Similarity = genome sequence alignments Need to align 10000 sequences of length

6 Cost of Obtaining Similarities identify Internet topology Similarity = covariance of delay measurements Need to send O(N 2 ) packets ~ ! Significant burden on network resources phylogenetics Similarity = genome sequence alignments Need to align sequences of length 10000! Computationally expensive to compare sequences conditions/drugs genes identify co-regulated genes Similarity = Pearson correlation of expressions Need to test several conditions/drugs ~ x 10000! High cost of experimentation sonar Excessive burden to human experts

7 Hierarchical Clustering Given N objects and pairwise similarities between them, arrange the objects into a hierarchy of clusters Sample Complexity: Minimum number of similarities needed to cluster?

8 Motivation: Internet Topology Inference Correlation between traffic patterns at two points can indicate the similarity between nodes (e.g., number of shared links in paths) SYN-ACK! SYN!

9 Passive vs. Active Clustering Generate a hierarchical clustering of the objects Passive Clustering: given all or a random subset of pre-selected pairwise similarities. Active Clustering: by sequentially selecting which pairwise similarities to measure in an adaptive fashion.

10 Passive vs. Active Clustering Generate a hierarchical clustering of the objects Passive Clustering: given all or a random subset of pre-selected pairwise similarities. Active Clustering: by sequentially selecting which pairwise similarities to measure in an adaptive fashion. Example:

11 Passive vs. Active Clustering Generate a hierarchical clustering of the objects Passive Clustering: given all or a random subset of pre-selected pairwise similarities. Active Clustering: by sequentially selecting which pairwise similarities to measure in an adaptive fashion. Example: very dissimilar very similar

12 Passive vs. Active Clustering Generate a hierarchical clustering of the objects Passive Clustering: given all or a random subset of pre-selected pairwise similarities. Active Clustering: by sequentially selecting which pairwise similarities to measure in an adaptive fashion. Example: very dissimilar very similar

13 Passive vs. Active Clustering Generate a hierarchical clustering of the objects Passive Clustering: given all or a random subset of pre-selected pairwise similarities. Active Clustering: by sequentially selecting which pairwise similarities to measure in an adaptive fashion. Example: very dissimilar very similar need to ask?

14 Cost of Passive Clustering To cluster N objects, most clustering methods require all N(N 1) 2 pairwise similarities

15 Cost of Passive Clustering To cluster N objects, most clustering methods require all N(N 1) 2 pairwise similarities Can random sub-sampling reduce this burden? Theorem 1: To recover clusters of size m, any clustering procedure based on random sub-sampling requires at least N(N 1) m similarities. Small clusters are hard to find using randomly selected similarities

16 Cost of Passive Clustering To cluster N objects, most clustering methods require all N(N 1) 2 pairwise similarities Can random sub-sampling reduce this burden? Theorem 1: To recover clusters of size m, any clustering procedure based on random sub-sampling requires at least N(N 1) m similarities. Small clusters are hard to find using randomly selected similarities Related fact: Clusters of size O(N) can be recovered from random samples of O(N log N) similarities (Balcan-Gupta 10, Shamir-Tishby 11) Any method needs at least N similarities (at least one per object)

17 Hierarchical Clustering Given N objects and pairwise similarities between them, arrange the objects into a hierarchy of clusters (aka cluster tree ) Identifiability: Is there an unambiguous way to hierarchically cluster?

18 Hierarchically Consistent Similarities Do the similarities conform to the hierarchical clustering structure?

19 Hierarchically Consistent Similarities Do the similarities conform to the hierarchical clustering structure? Consistency Assumption: Similarities are consistent with a hierarchical clustering if for all outlier i j k i C j, k C intra cluster similarity inter cluster similari.es

20 Hierarchically Consistent Similarities Do the similarities conform to the hierarchical clustering structure? Consistency Assumption: Similarities are consistent with a hierarchical clustering if for all outlier i j k i C j, k C intra cluster similarity inter cluster similari.es outlier(i, j, k) = i : max(s ij,s ik ) <s jk j : max(s ij,s jk ) <s ik k : max(s ik,s jk ) <s ij which two of the three are most similar

21 Active Clustering by Outlier Testing Methodology example: Ini.al Tree Structure 9 10

22 Active Clustering by Outlier Testing Methodology example: 1 6 New object

23 Active Clustering by Outlier Testing Methodology example: 1 6 New object True Placement in Tree 9 10

24 Active Clustering by Outlier Testing Methodology example: 1 6 v Find interior node (v) such that tree is roughly divided in two

25 Active Clustering by Outlier Testing Methodology example: v Choose two objects whose nearest common ancestor is v

26 Active Clustering by Outlier Testing Methodology example: v Find the outlier of the three objects outlier 4

27 Active Clustering by Outlier Testing Methodology example: v Find the outlier of the three objects. 6 This is found using 3 pairwise similari.es: 10 outlier 4

28 Active Clustering by Outlier Testing Methodology example: 6 v This resolves which subtree the new object is in. 10 outlier 4

29 Active Clustering by Outlier Testing Methodology example: 1 This process is repeated

30 Active Clustering by Outlier Testing Methodology example: Un.l the true loca.on for the new object is found True Placement

31 Active Clustering: Efficient Hierarchical Clustering A binary search procedure based on outlier(i, j, k) = i : max(s ij,s ik ) <s jk j : max(s ij,s jk ) <s ik k : max(s ik,s jk ) <s ij Each object can be placed using log N outlier tests 3 similarities/test / !"#$%&'()$*$+,% 6 New object! 3 105%

32 Active Clustering: Efficient Hierarchical Clustering A binary search procedure based on outlier(i, j, k) = i : max(s ij,s ik ) <s jk j : max(s ij,s jk ) <s ik k : max(s ik,s jk ) <s ij Each object can be placed using log N outlier tests 3 similarities/test / !"#$%&'()$*$+,% 6 New object! 3 105% Theorem 2: Under consistency assumption, hierarchical clustering of N objects can be recovered using at most 3N log N sequentially selected similarities. within a constant factor of the information theoretic lower bound

33 Active Clustering: Efficient Hierarchical Clustering A binary search procedure based on outlier(i, j, k) = i : max(s ij,s ik ) <s jk j : max(s ij,s jk ) <s ik k : max(s ik,s jk ) <s ij Each object can be placed using log N outlier tests 3 similarities/test / !"#$%&'()$*$+,% 6 New object! 3 105% Theorem 2: Under consistency assumption, hierarchical clustering of N objects can be recovered using at most 3N log N sequentially selected similarities. within a constant factor of the information theoretic lower bound Remark: Binary search strategy is closely related to computationally efficient schemes for learning tree-structured graphical models proposed in Pearl and Tarsi, Structuring Causal Trees, J. Complexity, 1986, pp

34 Experimental Results with Synthetic Internet Topology Network Core Number of Leaf Nodes Agglomera.ve Similari.es Required Outlier Similari.es Required Savings Over Exhaus.ve Approaches N= ,528 8, %

35 Sensitivity to Errors and Inconsistencies The previous technique is very sensitive to noise and inconsistencies S(, ) > max {S(, ), S(, )} outlier which can lead to cascading failures inconsistent similarity Es.mated Loca.on True Loca.on

36 Robust Active Clustering To overcome this fragility, we design a top-down recursive splitting approach and use voting to boost our confidence about each decision we make.

37 Robust Active Clustering To overcome this fragility, we design a top-down recursive splitting approach and use voting to boost our confidence about each decision we make. Goal : In each step, split a single cluster into 2 sub-clusters, efficiently and accurately

38 Cluster Splitting Procedure?

39 Cluster Splitting Procedure seed?

40 Cluster Splitting Procedure seed? m test pairs chosen uniformly at random

41 Cluster Splitting Procedure seed? m test pairs chosen uniformly at random outlier(,, ) = outlier(,, ) = or? If yes, then and are probably clustered together

42 Why? Consider outlier tests w/o errors

43 Why? Consider outlier tests w/o errors outlier(,, ) = outlier(,, ) =

44 Why? Consider outlier tests w/o errors outlier(,, ) = outlier(,, ) = outlier(,, ) = outlier(,, ) =

45 Why? Consider outlier tests w/o errors outlier(,, ) = outlier(,, ) = outlier(,, ) = outlier(,, ) = outlier(,, ) =? outlier(,, ) =?

46 Why? Consider outlier tests w/o errors outlier(,, ) = outlier(,, ) = outlier(,, ) = outlier(,, ) = outlier(,, ) =? outlier(,, ) =? outlier(,, ) = outlier(,, ) =

47 Why? Consider outlier tests w/o errors outlier(,, ) = outlier(,, ) = outlier(,, ) = outlier(,, ) = outlier(,, ) =? outlier(,, ) =? outlier(,, ) = outlier(,, ) = outlier(,, ) = outlier(,, ) = outlier(,, ) = outlier(,, ) = outlier(,, ) =? outlier(,, ) = outlier(,, ) = outlier(,, ) =?

48 Why? Consider outlier tests w/o errors outlier(,, ) = outlier(,, ) = outlier(,, ) = outlier(,, ) = outlier(,, ) =? outlier(,, ) =? outlier(,, ) = outlier(,, ) = outlier(,, ) = outlier(,, ) = outlier(,, ) = outlier(,, ) = outlier(,, ) =? outlier(,, ) = outlier(,, ) = outlier(,, ) =?

49 Why? Consider outlier tests w/o errors outlier(,, ) = outlier(,, ) = outlier(,, ) = outlier(,, ) = outlier(,, ) =? outlier(,, ) =? outlier(,, ) = outlier(,, ) = impossible if and are not clustered together outlier(,, ) = outlier(,, ) = outlier(,, ) = outlier(,, ) = outlier(,, ) =? outlier(,, ) = outlier(,, ) = outlier(,, ) =?

50 Clustered vs. Non-Clustered outlier(,, ) = outlier(,, ) = outlier(,, ) = outlier(,, ) = A outlier(,, ) =? outlier(,, ) =? outlier(,, ) = outlier(,, ) =

51 Clustered vs. Non-Clustered outlier(,, ) = outlier(,, ) = outlier(,, ) = outlier(,, ) = A outlier(,, ) =? outlier(,, ) =? outlier(,, ) = outlier(,, ) = If outlier tests are perfectly correct, then P(A, clustered) = 2 η (1 η), where η = fractional size of seed cluster P(A, not clustered) = 0

52 Clustered vs. Non-Clustered outlier(,, ) = outlier(,, ) = outlier(,, ) = outlier(,, ) = A outlier(,, ) =? outlier(,, ) =? outlier(,, ) = outlier(,, ) = If outlier tests are perfectly correct, then P(A, clustered) = 2 η (1 η), where η = fractional size of seed cluster P(A, not clustered) = 0 If outlier tests are incorrect with probability p<1/2, then P(A, clustered) 2 (1 p) η (1 η) P(A, not clustered) p

53 Sample Complexity of Robust Active Clustering m test pairs : seed? 2η(1 η) 1 2η(1 η), Theorem 3: If outlier tests are inconsistent with probability p< then we correctly hierarchically cluster N objects (into clusters of size log N) using O(N log 2 N) sequentially and adaptively selected pairwise similarities. extra log n factor due to voting over test pairs

54 Robust Active Clustering Experiments Synthetic experiments balanced binary tree with N = 512 objects Resolvability (smallest correct cluster size - smaller is better) probability of inconsistency Agglomerative Clustering Robust Active (38% measured) Robust Active (65% measured) 5% % %

55 Robust Active Clustering Experiments Gene microarray experiments, N=512 genes Bo_om up agglomera.ve clustering Top down ac.ve clustering (43% of the pairwise similari.es)

56 Summary randomly selected similarities (where m is the smallest cluster size) adaptively selected similarities consistent similarities all clusters of any size inconsistent similarities all clusters of size > log N O(N log 2 N) Ac.ve Clustering: Robust and Efficient Hierarchical Clustering using Adap.vely Selected Similari.es, Ar#ficial Intelligence and Sta#s#cs, AISTATS 11

57 Ranking Based on Pairwise Comparisons Ranking Problem: Consider a set of n objects x 1,..., x n R d. The locations of x,..., x n 1 are known, but location of x n is unknown. To gather information about x n, we can only ask questions of the form is object x n closer to x i than x j? The goal is to rank x 1,..., x n 1 relative to distances to x n by asking as few questions as possible. δ j j i δ i n Ordinal Data: 1(δ i < δ j ) Standard ranking methods assume the full set of ( n 2) comparisons are known, but in high-dimensional settings (n large) comparisons may be missing and/or costly to collect. However, many comparisons are redundant because the objects embed in R d, and therefore it may be possible to correctly rank based on a small subset.

59 Bartender: What s your favorite beer, sir?

60 Bartender: What s your favorite beer, sir? Al: Hmm... actually I m more of wine-man

61 Bartender: What s your favorite beer, sir? Al: Hmm... actually I m more of wine-man Bartender: Try these two samples. Do you prefer A or B?

62 Bartender: What s your favorite beer, sir? Al: Hmm... actually I m more of wine-man Bartender: Try these two samples. Do you prefer A or B? Al: B

63 Bartender: What s your favorite beer, sir? Al: Hmm... actually I m more of wine-man Bartender: Try these two samples. Do you prefer A or B? Al: B Bartender: Ok try these two: C or D?...

64 Ranking Relative to Distance A r Al s latent preferences in beer space (e.g, hoppiness, lightness, sweetness,...) B C F E D G

65 Ranking Relative to Distance A C < A < B < E < G < D < F r B C F E D G

66 Ranking Relative to Distance A B C E < B < F < G < C < A < D E r F D G

67 Ranking Relative to Distance A B C F D < G < C < E < A < B < F E D r G

68 Ranking Relative to Distance A Goal: Determine ranking by asking comparisons like, Is r closer to A or B? Weakness of randomized schemes: If comparisons are selected at random, then almost all ( n 2) comparisons are needed to rank. C B F D < G < C < E < A < B < F E D r G

69 Ranking with Adaptively Selected Queries Insert H into: D < G < C < E < A < B < F D G C E A B F { }

70 Ranking with Adaptively Selected Queries Insert H into: D < G < C < E < A < B < F D G C E A B F { } D G C E A B F {H < E}

71 Ranking with Adaptively Selected Queries Insert H into: D < G < C < E < A < B < F D G C E A B F { } D G C E A B F {H < E} D G C E A B F {H < E},{G < H}

72 Ranking with Adaptively Selected Queries Insert H into: D < G < C < E < A < B < F D G C E A B F { } D G C E A B F {H < E} D G C E A B F {H < E},{G < H} D G C E A B F {H < E},{G < H},{H < C} D < G < H < C < E < A < B < F

73 Ranking with Adaptively Selected Queries Insert H into: D < G < C < E < A < B < F D G C E A B F { } D G C E A B F {H < E} D G C E A B F {H < E},{G < H} D G C E A B F {H < E},{G < H},{H < C} D < G < H < C < E < A < B < F log 2 k comparisons to insert an item into a list of k objects = n log 2 n comparisons to rank n objects

74 Ranking with Adaptively Selected Queries Insert H into: D < G < C < E < A < B < F D G C E A B F { } D G C E A B F {H < E} D G C E A B F {H < E},{G < H} D G C E A B F {H < E},{G < H},{H < C} D < G < H < C < E < A < B < F log 2 k comparisons to insert an item into a list of k objects = n log 2 n comparisons to rank n objects... but does embedding dimension d affect the sample complexity?

75 Ranking by Exploiting Low-Dimensional Geometry Many comparisons are redundant because the objects embed in R d, and therefore it may be possible to correctly rank based on a small subset. binary information we can gather: q i,j x n is closer to x i than x j

76 Ranking by Exploiting Low-Dimensional Geometry Many comparisons are redundant because the objects embed in R d, and therefore it may be possible to correctly rank based on a small subset. binary information we can gather: q i,j x n is closer to x i than x j Sequential Data Selection input: x 1,..., x n 1 R d and x n at unknown position in R d initialize: x 1,..., x n 1 in uniformly random order for k=2,...,n-1 for i=1,...,k-1 if q i,k is ambiguous given {q i,j } i,j<k, then ask for pairwise comparison, else impute q i,j from {q i,j } i,j<k output: ranking of x 1,..., x n 1 consistent with all pairwise comparisons

77 Ranking and Geometry

78 Ranking and Geometry Definition: Answers to previous queries induces a Region of Ambiguity. Any query that intersects this region is said to be Ambiguous. Otherwise its Unambiguous

79 Ranking and Geometry

80 Ranking and Geometry # of d-cells k2d d!

81 Ranking and Geometry # of d-cells k2d d!

82 Ranking and Geometry # of d-cells k2d d! # intersected k2(d 1) (d 1)!

83 Ranking and Geometry # of d-cells k2d d! # intersected k2(d 1) (d 1)! = P(ambiguous) d k 2

84 Ranking and Geometry # of d-cells k2d d! # intersected k2(d 1) (d 1)! = P(ambiguous) d k 2 = E[#ambiguous] d k

85 Ranking and Geometry # of d-cells k2d d! # intersected k2(d 1) (d 1)! = P(ambiguous) d k 2 (Coombs 1960) (Buck 1943) (Cover 1965) = E[#ambiguous] d k n = E[# requested] k=2 d k d log n (Jamieson & Nowak 2011)

86 Summary # of comparisons needed to rank random selection sequential w/o geometry exploiting geometry noise-tolerant O(n 2 ) O(n log n) O(d log n) O(d log 2 n) Ac.ve Ranking using Pairwise Comparisons, NIPS 11 code:

Sonar Example Sonar echo audio signals bounced off: {50 targets, 50 rocks } S i,j = {human-judged similarity between signals i and j} Learning task: Leave one signal out of the set and rank the other

87 Sonar Example Sonar echo audio signals bounced off: {50 targets, 50 rocks } S i,j = {human-judged similarity between signals i and j} Learning task: Leave one signal out of the set and rank the other 99 using comparisons: q i,j {S i, <S j, } Compute d-dim embedding using MDS with similarity matrix. S i, <S j, x i r < x j r because embedding is approximate Dimension 2 3 % of queries requested Average d(y, ỹ) error d(y, ŷ) Kendell Tau % of queries we requested best achievable error our algorithm s error

Active Clustering: Robust and Efficient Hierarchical Clustering using Adaptively Selected Similarities

Active Clustering: Robust and Efficient Hierarchical Clustering using Adaptively Selected Similarities Brian Eriksson Gautam Dasarathy Aarti Singh Robert Nowak Boston University University of Wisconsin