Scalable and Practical Probability Density Estimators for Scientific Anomaly Detection

Size: px

Start display at page:

Download "Scalable and Practical Probability Density Estimators for Scientific Anomaly Detection"

Emil Tate
5 years ago
Views:

1 Scalable PDEs p.1/107 Scalable and Practical Probability Density Estimators for Scientific Anomaly Detection Dan Pelleg Andrew Moore (chair) Manuela Veloso Geoff Gordon Nir Friedman, the Hebrew University

2 Scalable PDEs p.2/107 Clustering Large Data Sets Quickly 70 weight axles length

3 Scalable PDEs p.3/107 Sloan Digital Sky Survey as an example of data collection, storage, and sharing: Goal: map, in detail, one-quarter of the entire sky 5 years to complete 200 million objects in catalog 25 TB raw data, 5 TB catalog data Access over the web at

4 SkyServer Scalable PDEs p.4/107 Supported activities on the SDSS SkyServer: Browse Learn Search by coordinates Send SQL query APIs for direct integration

5 Advancing SkyServers Scalable PDEs p.5/107 Make it easier to ask the right question Make it easier to understand the answer

6 Scalable PDEs p.6/107 Requirements from Next Generation data analysis tools: Fast Comprehensible output Turn-key

7 Scalable PDEs p.7/107 Focus on clustering. Very general Lots of applications In particular, mixture-model based clustering.

8 Scalable PDEs p.8/107 Talk outline: K-means and X-means: fast spatial clustering Mixture of Rectangles: highly legible model Anomaly Hunting Sub-linear component learner Active learner User interface

9 K-means Scalable PDEs p.9/107

10 K-means Scalable PDEs p.10/107 During the K-means algorithm, we maintain a set of centroids.

11 K-means Scalable PDEs p.11/107 In every iteration, each data point is associated with its closest centroid.

12 K-means Scalable PDEs p.12/107 At the end of an iteration, we move each centroid to the center of mass of all points associated with it.

13 K-means Scalable PDEs p.13/107

14 K-means Scalable PDEs p.14/107

15 K-means Scalable PDEs p.15/107

16 Cost of K-means Scalable PDEs p.16/107 Cost per iteration: #records #centroids

17 kd-tree Scalable PDEs p.17/107

18 A kd-tree Scalable PDEs p.18/107

19 A kd-tree Scalable PDEs p.19/107

20 A kd-tree Scalable PDEs p.20/107

21 A kd-tree Scalable PDEs p.21/107

22 A kd-tree Scalable PDEs p.22/107

23 A kd-tree Scalable PDEs p.23/107

24 A kd-tree Scalable PDEs p.24/107 A binary tree to store data points. Each node stores statistics about all points contained in it. Not the only structure meeting these conditions.

25 K-means Scalable PDEs p.25/107

26 Center-of-mass calculation Scalable PDEs p.26/107 Suppose Q is the set of all points that belong to some centroid C. The new position of C is: C x Q = x Q Let {Q p } be a partition of Q. Then we can write the new position as: C p x Q = p x p Q p This helps if the sums of each Q p are known. They are known for kd-nodes.

27 K-means Scalable PDEs p.27/107

28 A kd-node owned by a centroid Scalable PDEs p.28/107 The boundary line between centroids G and R does not intersect the rectangle H.

29 A kd-node owned by a centroid Scalable PDEs p.28/107 The boundary line between centroids G and R does not intersect the rectangle H. The point in H which is closest to R is on the same side of the boundary as G.

30 A kd-node owned by a centroid Scalable PDEs p.28/107 The boundary line between centroids G and R does not intersect the rectangle H. The point in H which is closest to R is on the same side of the boundary as G. Scanning every point in the node is not needed.

31 A kd-node not owned by a centroid Scalable PDEs p.29/107

32 A kd-node not owned by a centroid Scalable PDEs p.29/107

33 A kd-node not owned by a centroid Scalable PDEs p.29/107 The boundary line between centroids G and R does intersect the rectangle H.

34 A kd-node not owned by a centroid Scalable PDEs p.29/107 The boundary line between centroids G and R does intersect the rectangle H. The point in H which is closest to G is not on the same side of the boundary as R.

35 A kd-node not owned by a centroid Scalable PDEs p.29/107 The boundary line between centroids G and R does intersect the rectangle H. The point in H which is closest to G is not on the same side of the boundary as R. We try our luck with the child rectangles.

36 Run time Scalable PDEs p.30/ clusters 50 clusters 500 clusters gpetro data 0.4 time points 2-D data

37 K-means: summary Scalable PDEs p.31/107 Popular and trusted statistical method Very fast algorithm; not approximation Not restricted to kd-trees Still requires K from user

38 X-means Scalable PDEs p.32/107

39 X-means Scalable PDEs p.33/107 The number of clusters K is not always known in advance. Estimate from data; measure the goodness of fit, and penalize complex models. Do this on a local scale.

40 Local Splits Scalable PDEs p.34/107 Start with a small value for K. Run K-means to convergence.

41 Scalable PDEs p.35/107 This defines regions of points which belong to a specific class.

42 In each region run 2-means independently. Scalable PDEs p.36/107

43 Scalable PDEs p.37/107

44 Scalable PDEs p.38/107 BIC(k=1)=2471 BIC(k=2)=3088 BIC(k=1)=2018 BIC(k=2)=1859 BIC(k=1)=1935 BIC(k=2)=1784 For each region compute the contribution of splitting the class in two.

45 Commit the split only if the score goes up. Scalable PDEs p.39/107

46 X-means: summary Scalable PDEs p.40/107 Can accurately estimate the K in K-means Naturally fits in the fast K-meansframework In a single step chooses between 2 K options Better, faster, than looping over K

47 K-means and X-means package Scalable PDEs p.41/107 Code released in late 2000 Over 200 licenses granted Users in: Bioinformatics Music information retrieval Computer hardware and software analysis Many more X-means scoring function independently analyzed and improved (Hamerly et al. 2003)

48 K-means and X-means users Scalable PDEs p.42/107

49 K-means and X-means users Scalable PDEs p.42/107

50 Mixtures of Rectangles Scalable PDEs p.43/107

51 Gaussian clusters Scalable PDEs p.44/107 Domain: credit card approval. Take the following vector: [(AGE 18) 2, (taxrate 6) 2, (income 10000) 2, (edunum 8) 2 ] and compute its dot-product with: 1/[4.9,.3, 730, 209]. If the result is small enough, approve.

52 My approach Scalable PDEs p.45/107 If 18 AGE 46 and 5 taxrate 7 then approve.

53 2-D PDF Scalable PDEs p.46/

54 Mixture of Dependency Trees Scalable PDEs p.47/107

55 Motivation Scalable PDEs p.48/107 Given a data-set, want to understand it A Bayes net fits the bill, but expensive to find Compromise: look for a simpler structure dependency tree

56 Scalable PDEs p.49/107 Burglar Thunder Barking Phone Call Alarm P (A B) N(c A + m A b, σ 2 A )

57 The Chow-Liu algorithm Scalable PDEs p.50/107 A 1 A 2 A M X 1 X 2 X 3 X 4. X R

58 The Chow-Liu algorithm Scalable PDEs p.50/107 A 1 A 2 A M X 1 X 2 X 3 X 4. X R A5 I(1; 4) I(1; 5) A1 A4 I(1; 3) I(4; 3) A2 A3

59 The Chow-Liu algorithm Scalable PDEs p.50/107 A 1 A 2 A M X 1 X 2 X 3 X 4. X R A5 I(1; 4) I(1; 5) A1 MST A5 A1 A4 A4 I(1; 3) I(4; 3) A2 A2 A3 A3

60 The Chow-Liu algorithm Scalable PDEs p.50/107 A 1 A 2 A M X 1 X 2 X 3 X 4. X R Total Cost: O(RM 2 )+cost of MST algorithm. A5 I(1; 4) I(1; 5) A1 MST A5 A1 A4 A4 I(1; 3) I(4; 3) A2 A2 A3 A3

61 MST: using the blue-edge rule Scalable PDEs p.51/107 Given a cut, the lightest edge across it must be part of the MST.

62 MST: using the red-edge rule Scalable PDEs p.52/107 Given a cycle, the heaviest edge in it must not be part of the MST.

63 Scalable PDEs p.53/107 Idea: repeatedly use the red-edge rule Stop when all we have left is a tree This tree must be the MST Tarjan: bad idea.

64 Walkthrough Scalable PDEs p.54/107

65 Walkthrough Scalable PDEs p.55/107 Tree edge Non tree edge

66 Walkthrough Scalable PDEs p.56/107 Tree edge Non tree edge

67 Walkthrough Scalable PDEs p.57/107 Can I eliminate this edge? Tree edge Non tree edge

68 Walkthrough Scalable PDEs p.58/107 Tree edge Non tree edge

69 Walkthrough Scalable PDEs p.59/107 Tree edge Non tree edge Eliminated edge

70 Walkthrough Scalable PDEs p.60/107

71 Walkthrough Scalable PDEs p.61/107

72 Walkthrough Scalable PDEs p.62/107

73 Walkthrough Scalable PDEs p.63/107

74 Walkthrough Scalable PDEs p.64/107

75 Walkthrough Scalable PDEs p.65/107

76 Walkthrough Scalable PDEs p.66/107

77 Walkthrough Scalable PDEs p.67/107

78 Walkthrough Scalable PDEs p.68/107

79 Walkthrough Scalable PDEs p.69/107

80 Walkthrough Scalable PDEs p.70/107

81 Saving Work Scalable PDEs p.71/107 We want to avoid scanning the full data-set for a given edge. Scan just a sample Derive a confidence interval using the CLT Or Hoeffding bounds Now need to deal with intervals instead of point estimates

82 Comparing intervals Scalable PDEs p.72/107 c d a b

83 Comparing intervals Scalable PDEs p.72/107 c d a b a Case 1: c b d

84 Comparing intervals Scalable PDEs p.72/107 c d a b a Case 1: If this happens, we save work. c b d

85 Comparing intervals Scalable PDEs p.73/107 c d a b

86 Comparing intervals Scalable PDEs p.73/107 c d a b a Case 2: d b c

87 Comparing intervals Scalable PDEs p.73/107 c d a b a Case 2: d b c Another lucky occurrence.

88 Comparing intervals Scalable PDEs p.74/107 c d a b

89 Comparing intervals Scalable PDEs p.74/107 c d a b Case 3: a c b d

90 Comparing intervals Scalable PDEs p.74/107 c d a b Case 3: We have two options: a c b d

91 Comparing intervals Scalable PDEs p.74/107 c d a b Case 3: We have two options: Work harder a c b d

92 Comparing intervals Scalable PDEs p.74/107 c d a b Case 3: We have two options: Work harder Procrastinate a c b d

93 Scalable PDEs p.75/107 So far we assumed that we can always eliminate an edge in the cycle In fact, this is not necessary

94 Walkthrough - alternative scenario Scalable PDEs p.76/107

95 Walkthrough - alternative scenario Scalable PDEs p.77/107 Not enough information to eliminate.

96 Walkthrough - alternative scenario Scalable PDEs p.78/107 Not enough information to eliminate. Leave for later.

97 Walkthrough - alternative scenario Scalable PDEs p.79/107 Later...

98 Walkthrough - alternative scenario Scalable PDEs p.80/107 Let s examine this edge again.

99 Walkthrough - alternative scenario Scalable PDEs p.81/107 The tree path has changed! We can eliminate!

100 Walkthrough - alternative scenario Scalable PDEs p.82/107

101 Experimental Results Scalable PDEs p.83/107

102 Experimental Results Scalable PDEs p.84/107 How much work does it save?

103 Experimental Results Scalable PDEs p.84/107 How much work does it save? cells per edge e e+06 records

104 Experimental Results Scalable PDEs p.84/107 How much work does it save? cells per edge e e+06 records most of it.

105 Experimental Results Scalable PDEs p.85/107 Does it scale with the number of attributes?

106 Experimental Results Scalable PDEs p.85/107 Does it scale with the number of attributes? running time number of attributes

107 Experimental Results Scalable PDEs p.85/107 Does it scale with the number of attributes? running time number of attributes Yes!

108 Experimental Results Scalable PDEs p.86/107 How good are the generated trees?

109 Evaluation Scalable PDEs p.87/107 Exhaustive algorithm

110 Evaluation Scalable PDEs p.87/107 Exhaustive algorithm My algorithm

111 Evaluation Scalable PDEs p.87/107 Exhaustive algorithm My algorithm 35% subsample

112 Evaluation Scalable PDEs p.87/107 Exhaustive algorithm My algorithm 35% subsample Informed subsample

113 Experimental Results Scalable PDEs p.88/107 How good are the generated trees?

114 Experimental Results Scalable PDEs p.88/107 How good are the generated trees? 2 relative log-likelihood e e+06 records

115 Experimental Results Scalable PDEs p.88/107 How good are the generated trees? 2 relative log-likelihood e e+06 records Better then those obtained by uniformly using same fraction of data.

116 Experimental Results Scalable PDEs p.89/107 Does it work for real data?

117 Experimental Results Scalable PDEs p.89/107 Does it work for real data? NAME ATTR. RECORDS TYPE DATA USAGE MIST SAMPLE CENSUS-HOUSE N 1.0% COLORHISTOGRAM N 0.5% COOCTEXTURE N 4.6% ABALONE N 21.0% COLORMOMENTS N 0.6% CENSUS-INCOME C 0.05% COIL C 0.9% IPUMS C 0.06% KDDCUP C 0.02% LETTER N 1.5% COVTYPE C 0.009% PHOTOZ N 0.008%

118 Experimental Results Scalable PDEs p.89/107 Does it work for real data? NAME ATTR. RECORDS TYPE DATA USAGE MIST SAMPLE CENSUS-HOUSE N 1.0% COLORHISTOGRAM N 0.5% COOCTEXTURE N 4.6% ABALONE N 21.0% COLORMOMENTS N 0.6% CENSUS-INCOME C 0.05% COIL C 0.9% IPUMS C 0.06% KDDCUP C 0.02% LETTER N 1.5% COVTYPE C 0.009% PHOTOZ N 0.008% Better 7/12 times, worse 4/12, one tie.

119 Anomaly Hunting Scalable PDEs p.90/107

120 Anomaly Hunting Scalable PDEs p.91/107 Want to sift a large data set for strangest objects. First attempt: build a statistical model for/from the data, flag whatever does not fit it well.

121 Boring Anomalies Scalable PDEs p.92/107

122 The Oracle Framework Scalable PDEs p.93/107 Random set of records

123 The Oracle Framework Scalable PDEs p.94/107 Random set of records Ask expert to classify

124 The Oracle Framework Scalable PDEs p.95/107 Random set of records Ask expert to classify Build model from data and labels

125 The Oracle Framework Scalable PDEs p.96/107 Random set of records Ask expert to classify Build model from data and labels Run all data through model

126 The Oracle Framework Scalable PDEs p.97/107 Random set of records Ask expert to classify Spot "important" records Build model from data and labels Run all data through model

127 The Oracle Framework Scalable PDEs p.98/107 Random set of records Ask expert to classify Spot "important" records Build model from data and labels Run all data through model

128 The Oracle Framework Scalable PDEs p.99/107 Random set of records Ask expert to classify Spot "important" records Build model from data and labels Run all data through model

129 The Oracle Framework Random set of records Ask expert to classify Spot "important" records Build model from data and labels Run all data through model Scalable PDEs p.100/107

130 The Oracle Framework Random set of records Ask expert to classify Spot "important" records Build model from data and labels Run all data through model Scalable PDEs p.101/107

131 Anomaly Hunting Run GUI Scalable PDEs p.102/107

132 Interesting Anomalies Scalable PDEs p.103/107

133 Contributions Fast K-means implementation [KDD99] Extension to X-means [ICML00] Widely used, cited [HPL124,NIPS03,ICME01,ASPL02,IEEE01] Novel mixture model for comprehensibility [ICML01] probably approximately correct approach for dependency trees [NIPS02] Active learning framework for general mixtures User-centered anomaly hunting process [GLC03] Scalable PDEs p.104/107

134 Scalable PDEs p.105/107

135 Why scientific? Assumptions on data: Mostly real-valued Not sparse No or very few labels Scalable PDEs p.106/107

136 Thesis Statement We can efficiently perform clustering on very large data sets. Scalable PDEs p.107/107

ALTERNATIVE METHODS FOR CLUSTERING

ALTERNATIVE METHODS FOR CLUSTERING K-Means Algorithm Termination conditions Several possibilities, e.g., A fixed number of iterations Objects partition unchanged Centroid positions don t change Convergence