Scalable Bayes Clustering for Outlier Detection Under Informative Sampling

Size: px

Start display at page:

Download "Scalable Bayes Clustering for Outlier Detection Under Informative Sampling"

Magdalen Bradford
5 years ago
Views:

1 Scalable Bayes Clustering for Outlier Detection Under Informative Sampling Based on JMLR paper of T. D. Savitsky Terrance D. Savitsky Office of Survey Methods Research FCSM March 7-9, / 21

2 Motivating Dataset Monthly survey of U.S. business establishments Single stage, fixed-size, stratified sampling design Strata-indexed probabilities assigned by employment size pps establishments report employment changes (d=4) 1 x = (Employment, Production Workers, Payroll, Weekly Hours) Size variable is total employment, z x z. 7 day turnaround between submissions and publication Which establishment submissions contain reporting errors? 2 / 21

3 Estimating Population Model Under Informative Sample Finite U = (1,..., N) With data, X U = X 1,..., X N P θ. Don t fully observe the finite population. Draw a sample, S = (1,..., n N). Inclusion probabilities, P (δ i = 1) := π i correlated with X U P θ (X S ) P θ (X U ) Want to estimate outliers from P θ (X U ) using X S. Use w i 1/π i 3 / 21

4 Mixture / Cluster Model for Outlier Detection Mixture of Gaussians s i (1,..., K max ) indexes cluster memberships for i (1,..., n) (τ 1,..., τ Kmax ) cluster assignment probabilities α, number of τ p > 0, Dirichlet Process mixing measure in the limit of K max d 1 x i s i, M = (µ 1,..., µ Kmax ), σ 2, w i ind N d ( µsi, σ 2 I d ) wi s i τ iid M (1, τ 1,..., τ Kmax ) µ p G 0 iid G0 := N d ( 0, ρ 2 I d ) τ 1,..., τ Kmax D (α/k max,..., α/k max ) 4 / 21

5 Sampling-weighted Pseudo Posterior Pseudo Posterior Weighted Likelihood Priors Marginalize out τ from the joint prior, f (s, τ α) = f (s τ ) f (τ α) K d M = (µ 1,..., µ K ) n p = n i=1 1 (s i = p) number of establishments assigned to cluster, p K ( f (s, M X, w) f (X, s, M w) = N d xi µ p, σ 2 ) wi I d p=1 i:s i =p K Γ (α + 1) α Γ (α + n) K (n p 1)! p=1 K ( N d µp 0, ρ 2 ) I d. p=1 5 / 21

6 Approximate MAP as σ 2 0 Each observation assigned to its own cluster as σ 2 0 Define a constant λ and set α = exp ( λ/ ( 2σ 2)) Produces α 0 as σ 2 0 λ hyperparameter controls the size of the partition as σ 2 0 K 2σ 2 [ log f (X, s, M, w) = 2σ 2 O ( log σ 2) + w i x i µ p 2] p=1 i:s i =p + Kλ 2σ 2 O (1) 2σ 2 O (1), 6 / 21

7 Approximate MAP Optimization argmin K,s,M K p=1 i:s i =p w i x i µ p 2 + Kλ, Bayesian motivation for K-means clustering Higher value for λ reduces number of estimated clusters Goal to minimize energy expression 7 / 21

8 Add Merge Step to Algorithm Test all pairs of clusters and merge those that reduce energy Collapse 2 clusters by assigning establishments from both to single cluster Recompute cluster center, µ p Encourages fewer clusters, which supports outlier detection Reduces sensitivity to initial values 8 / 21

9 Weighted Hierarchical Clustering - Set-up Establishments, i = 1,..., n, binned to j = 1,..., J industry groups Estimate a local clustering of L max possible clusters in industry, j. Local cluster, c, in industry, j, connected to global cluster center, µ p For p (1,..., K max ) possible global clusters Local clusters across industries may share a common global cluster s j i global cluster assignment for establishment, i, in industry, j 9 / 21

10 Hierarchical Clustering Optimization argmin K,s,M K J p=1 j=1 i:s j i =p w j i xj i µ p 2 + Kλ K + Lλ L, L = J j=1 L j denotes the total number of local clusters L j denotes the number local clusters estimated for data set, j = 1,..., J K denotes the number of estimated global clusters λ K denotes penalty on number of global clusters estimated λ L denotes penalty on number of local clusters estimated w j i is the sampling weight for establishment, i, in industry, j 10 / 21

11 Selecting Penalty Parameters, (λ K, λ L ) Synthetic data, L j = 5 local clusters for j = 1,..., (J = 3) industries Sharing K = 7 global clusters X j N j (d=15) (N j = 15000, n j = 2500) establishments in (population/sample) Randomly allocated to L j = 5 in skewed distribution, (0.6, 0.25, 0.1, 0.025, 0.025) Evenly divide data into training and test sets Estimate clustering on training data and compute energy on test data 11 / 21

12 Energy steadily decreases with lower (λ K, λ L ) Estimate clustering on training data and compute energy on test data Lambda_g Energy - Test Set 8e+06 6e+06 4e+06 2e Lambda_l 12 / 21

13 Use Calinski-Harabasz (C) criterion Cohesion within each cluster, W GSS Separation between clusters, BGSS W GSS = BGSS = K p=1 w i x i µ p 2 i:s v i =k K n p µ p µ G 2 p=1 C = n K BGSS K 1 W GSS 13 / 21 µ G = n i=1 w ix i n i=1 w i K is number of global clusters

14 C finds an optimum chose the values of (λ L = 1232, λ K = 2254) Calinski_Harabasz Lambda_g Lambda_l / 21

15 Correct Clusterings Estimated Each panel presents a local clustering for industry, j (1,..., (J = 3)). We see L j = 5 with correct skewed allocation Sharing K = 7 global clusters dataset_1 dataset_2 dataset_ Number of Observations / Global Cluster

16 Merges Increase at lower values for (λ K, λ L ) Higher number of merges for lower values of (λ K, λ L ) Lambda_g 2000 num_merges Lambda_l / 21

17 Outlier Detection Simulation Study Design J = 8 local populations, X j with N j = L j = 2 local clusters, one an outlier, sharing K = 5 global clusters (d=15) 1 µ 1 = (1, 1.5, 2.0,..., 7.5, 8) µ 2 = (8, 7.5,..., 1) µ 3 = (1,..., 7, 8, 7,..., 1) µ 4 = Sampling from (1,..., 8) with replacement, d = 15 times µ 5 = Sampling from ( 2,..., 6) with replacement, d = 15 times, mean µ 5 is assigned 150 observations 17 / 21 Stratified design of H = 10 strata assign π j h variance of, Xj h B = 100 Monte Carlo draws

18 Outlier Detection Accuracy True positive # of true outliers discovered / total # of true outliers False positive # of false discoveries / total # nominated True positives measure effectiveness, False positives measure efficiency true_pos false_pos Estimation Type hier 95% CI global mbc t-kmeans hier global mbc t-kmeans global_ignore hier global mbc t-kmeans global_ignore global_ignore Outlier Cluster Assigment Statistic 18 / 21

19 Estimation Bias of Outlier Center, µ 5 For each d = 15 dimensions Dashed line presents true values % CI m p hier global mbc gl_ignore hier global mbc gl_ignore hier global mbc gl_ignore Estimation Type hier global mbc gl_ignore Outlier Cluster Centers 19 / 21

20 Take Aways Fast hierarchical clustering captures dependencies among industry clusterings. Incorporating sampling weights better detects outliers from the population. Implemented in growclusters in R. 20 / 21

21 CONTACT INFORMATION 21 / 21

Scalable Approximate Bayesian Inference for Outlier. Detection under Informative Sampling

Journal of Machine Learning Research 17 (2016) 1-49 Submitted 3/15; Revised 1/16; Published 12/16 Scalable Approximate Bayesian Inference for Outlier Detection under Informative Sampling Terrance D. Savitsky