Model-Based Clustering for Online Crisis Identification in Distributed Computing

Size: px

Start display at page:

Download "Model-Based Clustering for Online Crisis Identification in Distributed Computing"

Garey Blankenship
5 years ago
Views:

1 Model-Based Clustering for Crisis Identification in Distributed Computing Dawn Woodard Operations Research and Information Engineering Cornell University with Moises Goldszmidt Microsoft Research 1

2 Outline 1 Background and Overview 2 3 Computation Computation Decision Making

3 Outline 1 Background and Overview 2 3 Computation Computation Decision Making

4 Distributed Computing Commercial distributed computing providers: Offer remotely-hosted computing services E.g. Microsoft s Exchange Hosted Services (EHS) 24/7 processing incl. spam filtering, encryption 4

5 Distributed Computing This processing is performed by farming out to many servers Often, tens of thousands of servers in multiple locations Server 1 Client Provider Server 2 Server 3 5

6 Distributed Computing 6

7 Distributed Computing Can have occasional severe violation of performance goals ( crises ) E.g. due to: servers becoming overloaded in periods of high demand performance problems in lower-level computing centers on which the servers rely (e.g. for performing authentication) If the problem lasts for more than a few minutes, must pay cash penalties to clients, have potential loss of contracts 7

8 Distributed Computing % of servers violating a performance goal, for a 10-day period in EHS: KPI KPI Time Exceeding the dotted line constitutes a crisis. Metr

9 Distributed Computing Need to rapidly recognize the recurrence of a problem If an effective intervention is known for this problem, can apply it Due to large scale and interdependence, manual problem diagnosis is difficult and slow Have a set of status measurements for each server. E.g., for EHS: CPU utilization Memory utilization For each spam filter, the length of the queue and the throughput... 9

10 Distributed Computing Goal: Match a currently occurring (i.e., incompletely observed) crisis to previous crises of mixed known and unknown cause I.e., are any previous crises of the same type as the new crisis? Which ones? This is an online clustering problem with: partial labeling incomplete data for the new crisis We use model-based clustering based on a Dirichlet process mixture (e.g. Escobar & West 1995) The evolution of each process is modeled as a time series 10

11 Cost-Optimal Decision Making Wish to perform optimal (expected-cost-minimizing) decision making during a crisis......while accounting for uncertainty in the crisis type assignments and the parameters of those types This requires fully Bayesian inference 11

12 Fully Bayesian Inference We apply fully Bayesian inference (via MCMC) in the long periods between crises Due to posterior multimodality, we combine a collapsed-space split-merge method with parallel tempering As a new crisis begins, update rapidly using an approximation 12

13 Related Work Ours is the first instance of fully Bayesian online clustering model-based clustering was performed by Zhang, Ghahramani, and Yang (2004) for documents Obtain a single cluster assignment based on the posterior; insufficient for optimal decision making Fully Bayesian clustering: Bensmail, Celeux, Raftery, and Robert (1997); Pritchard, Stephens, and Donnelly (2000); Lau and Green (2007) Many examples of fully Bayesian mixture modeling 13

14 Outline 1 Background and Overview 2 3 Computation Computation Decision Making

15 Data Medians of 3 metrics across servers, for a 10-day period (EHS): Time 15

16 Data Crises are highlighted; color indicates their known type: Time 16

17 Data The medians of the metrics are very informative as to crisis type specifically, whether the median is low, normal, or high We fit our models to the median values of the metrics, discretized into 1: low, 2: normal, and 3: high 17

18 Crisis Time series model for crisis evolution: Y ilj: value of metric j in the lth time period after the start of crisis i Assume that metrics are independent conditional on the crisis type For crisis type k, Y i1j is drawn from a discrete dist n with probability vector γ (jk)...and Y ilj evolves according to a Markov chain with transition matrix T (jk) 18

19 Crisis Complete-data likelihood fn: π D {Z i} I i=1, {γ (jk), T (jk) } j,k = Q " γ (j Z i) t i,j,t 1(Y i1j =t) Q s T (j Z i) st conditioning on the unknown type indicators Z i of each crisis i = 1,..., I. nijst # n ijst : the number of transitions of the jth metric from state s to state t during crisis i. (1) 19

20 Cluster Dirichlet process mixture (DPM) prior: Natural for online clustering Allows number of clusters to increase with the number of crises Crises are exchangeable Parameterized by α: controls the expected number of clusters occurring in a fixed number of crises G 0 : the prior G 0 (d{γ (jk), T (jk) } j ) for the parameters associated with each cluster k 20

21 Cluster The DPM prior for the cluster indicators {Z i} I i=1 and the cluster parameters γ (jk), T (jk) : π({z i } I i=1) = Q I π(z i {Z i } i <i ) i=1 " Q = I α α+i 1 1(Z i=m i 1 +1)+ 1 P α+i 1 i=1 i <i # 1(Z i =Z i ) (2) where m i = max{z i : i i} for i > 0 and m 0 = 0, and π d{γ (jk),t (jk) }j,k {Z i } I i=1 = m I Q k=1 G 0 d{γ (jk),t (jk) }j. (3) 21

22 Cluster Also called the Chinese Restaurant Process : π (Z i = k {Z i } i <i) 8 >< >: α : k is a new type P 1 (Z i = k) : else i <i Each observation i is a new guest who either sits at an occupied table with prob. proportional to the number of guests at that table, or sits at an empty table: 22

23 Cluster Now we can evaluate the posterior density (up to a normalizing constant): ( ) π {Z i } I i=1, {γ(jk), T (jk) } j,k D π ( {Z i } I ) ) ( ) i=1 π ({γ (jk), T (jk) } j,k {Z i } I i=1 π D {Z i } I i=1, {γ(jk), T (jk) } j,k 23

24 Cluster Partially labeled case: We have given the prior for the case where none of the crisis types Z i are known If we know that Z i = Z i for some crises i i, multiply (2) by Q i i 1(Z i = Z i ) 24

25 Cluster G 0 : Independent Dirichlet priors for γ (jk) for each j Independent product Dirichlet priors for T (jk) for each j 25

26 Computation Computation Decision Making Outline 1 Background and Overview 2 3 Computation Computation Decision Making

27 Computation Computation Decision Making Outline 1 Background and Overview 2 3 Computation Computation Decision Making

28 Computation Computation Decision Making Computation The cluster parameters {γ (jk), T (jk) } j,k can be integrated analytically out of the posterior Run a Markov chain with target dist n π({z i} I i=1 D) Jain and Neal (2004) use a Gibbs sampler, with an additional split-merge move on clusters We add parallel tempering (Geyer 1991) 28

29 Computation Computation Decision Making Outline 1 Background and Overview 2 3 Computation Computation Decision Making

30 Computation Computation Decision Making Inference Wish to identify a crisis in real time Have data D from previous crises and data D new so far for the new crisis E.g., wish to estimate π(z new = Z i D, D new) for each previous crisis i = 1,..., I...and π(z new Z i i D, D new) 30

31 Computation Computation Decision Making Exact Inference Method 1: Just apply the Markov chain method to the data from the I + 1 crises Gives posterior sample vectors {Z (l) i } I i=1, Z (l) for l = 1,..., L new Monte Carlo estimates of the desired probabilities: ˆπ(Z new = Z i D, D new) = 1 L ˆπ(Z new Z i i D, D new) = 1 L LP l=1 LP l=1 1(Z (l) new = Z (l) i ) 1(Z new (l) Z (l) i i) But running the Markov chain is too slow for real-time decision making! 31

32 Computation Computation Decision Making Approximate Inference We give a method using the approximation: π(z new = Z i D, D new) = X π(z new = Z i {Z i } I i=1, D, Dnew)π({Z i} I i=1 D, Dnew) {Z i } I i=1 X {Z i } I i=1 π(z new = Z i {Z i } I i=1, D, Dnew)π({Z i} I i=1 D) * Assumes that D new does not tell us much about the past crisis types 32

33 Computation Computation Decision Making Approximate Inference Method 2: Approximate Inference 1 After the end of each crisis, rerun the Markov chain, yielding sample vectors {Z (l) i } I i=1 from the posterior π({z i} I i=1 D). 2 When a new crisis begins, use its data D new to calculate the Monte Carlo estimates: ˆπ(Z new = Z i D, D new) = 1 L ˆπ(Z new Z i i D, D new) = 1 L LX l=1 LX l=1 π(z new = Z (l) i π(z new Z (l) i {Z (l) i } I i =1, D, D new) i {Z (l) i } I i =1, D, D new). 33

34 Computation Computation Decision Making Approximate Inference Part 2 is O(LIJ), very fast 34

35 Computation Computation Decision Making Outline 1 Background and Overview 2 3 Computation Computation Decision Making

36 Computation Computation Decision Making Optimal Decision Making Want expected-cost-minimizing decision making during a crisis The total cost of the new crisis is a function C ˆφ, {Z i } I i=1, Z new of: The intervention φ The true type Znew of the current crisis The vector of past crisis types {Z i } I i=1, which give the context for Z new Finding the expected cost of the crisis for intervention φ requires integrating C over the posterior distribution of `{Z i} I i=1, Z new Can be done exactly using Method 1, or approximately using Method 2 36

37 Outline 1 Background and Overview 2 3 Computation Computation Decision Making

38 Outline 1 Background and Overview 2 3 Computation Computation Decision Making

39 : Simulate I crises from model Compare MBC with distance-based clustering 39

40 Accuracy Criteria: 1 Pairwise Sensitivity: For pairs of crises of the same type, % assigned to the same cluster for MBC, having prob. > 0.5 of being in the same cluster. 2 Pairwise Specificity: For pairs of crises not of the same type, % assigned to different clusters for MBC, having prob. 0.5 of being in the same cluster. 3 Error of No. Crisis Types: The % error of the estimated number of crisis types for MBC, post. mean is used to estimate No. of types. 40

41 No. Crises No. Metrics Method Pairwise Pairwise % Error Sensitivity Specificity No. Types MBC 94.6 (2.08) 99.0 (0.50) 9.3 (1.87) K-Means (4.26) 95.3 (0.57) K-Means (5.39) 77.9 (1.73) MBC 99.0 (1.00) 99.4 (0.41) 3.7 (0.95) K-Means (4.76) 97.0 (0.54) K-Means (4.01) 78.2 (2.13) MBC 91.9 (1.88) 98.8 (0.40) 7.4 (1.58) K-Means (3.19) 95.5 (0.54) K-Means (4.01) 82.9 (1.16) MBC 99.6 (0.23) 99.9 (0.05) 3.5 (1.13) K-Means (3.76) 95.8 (0.57) K-Means (4.76) 83.0 (1.83) MBC 97.6 (0.65) 99.8 (0.08) 6.4 (1.81) K-Means (3.43) 95.9 (0.48) K-Means (3.93) 83.9 (1.15) MBC 99.5 (0.24) 99.9 (0.03) 3.4 (0.67) K-Means (4.07) 97.8 (0.27) K-Means (4.74) 86.7 (1.48) 41

42 MBC does far better than K-means More metrics better accuracy of MBC More crises better accuracy of MBC 42

43 Outline 1 Background and Overview 2 3 Computation Computation Decision Making

44 : Compare Method 1 ( MBC-EX ) to Method 2 ( MBC ) 44

45 Accuracy Criteria: 1 Full-data misclassification rate: % of crises with incorrect predicted type, using all of the data for the new crisis. 2 p-period misclassification rate: % of crises with incorrect predicted type, using the first p time periods of data for the new crisis. 3 Average time to correct identification: Avg. No. of time periods required to obtain the correct identification ( correct predicted type: ˆπ(Z new Z i i D, D new) > 0.5 if Znew Zi ˆπ(Z new = Z i D, D new) > 0.5 for some i I such that Znew = Zi ) i and otherwise 45

46 Accuracy: No. No. Method Full-data 3-period Avg. Time to Crises Metrics Misclassification Misclassification Identification MBC 6.7 (3.0) 10.7 (4.5) 1.31 (0.11) MBC-EX 8 (2.5) 10.7 (4.5) MBC 6.7 (5.2) 9.3 (6.2) 1.13 (0.08) MBC-EX 5.3 (3.9) 8.0 (4.9) MBC 13.6 (2.7) 15.2 (2.7) 1.33 (0.13) MBC-EX 9.6 (2.0) 15.2 (3.4) MBC 2.4 (1.6) 4.0 (1.8) 1.15 (0.06) MBC-EX 3.2 (1.5) 3.2 (1.5) 46

47 Classification accuracy high (> 80%) for both MBC & MBC-EX MBC not significantly worse than MBC-EX 3-period misclassification is not much > than full-data misclassification Very early identification! 47

48 Outline 1 Background and Overview 2 3 Computation Computation Decision Making

49 Application to EHS 27 crises in EHS during Jan-Apr The causes of some of these were diagnosed later: ID Cause No. of known crises A overloaded front-end 2 B overloaded back-end 8 C database configuration error 1 D configuration error 1 E performance issue 1 F middle-tier issue 1 G whole DC turned off and on 1 H workload spike 1 I request routing error 1 49

50 Outline 1 Background and Overview 2 3 Computation Computation Decision Making

51 Application to EHS Apply the Markov chain method to the set of 27 crises without the labels Compare to those labels 51

52 Application to EHS Trace plots of parallel tempering Markov chain samples of Z 22: beta = beta = beta = Geweke diag. p-value: 0.44 Gelman-Rubin scale factor:

53 Application to EHS Post. mode cluster assignment has 58% prob. Sizes of clusters: ID Cause No. of known No. identified No. MBC crises crises by MBC matching known A overloaded front-end B overloaded back-end C database configuration error D configuration error (labeled as A) E performance issue (labeled as B) F middle-tier issue (labeled as I) G whole DC turned off and on (labeled as B) H workload spike I request routing error

54 Application to EHS Post. mode crisis labels mostly match known clusters The largest 5 clusters are correctly labelled Four uncommon crisis types are clustered with more common types Crises having different causes can have the same patterns in their metrics Need to add metrics that distinguish these types effectively 54

55 Outline 1 Background and Overview 2 3 Computation Computation Decision Making

56 Application to EHS Evaluate online accuracy, treating the posterior mode from the offline context as the gold standard. Original ordering: 1 Full-data misclassification: 7.4% 2 3-period misclassification: 14.8% 3 Avg. time to correct iden.: 1.81 Permuting the crises: 1 Full-data misclassification: 5.9% (SE =3.4%) 2 3-period misclassification: 11.8% (SE =3.2%) 3 Avg. time to correct iden.: 1.56 (SE =0.07) 56

57 Outline 1 Background and Overview 2 3 Computation Computation Decision Making

58 Gave a method for fully Bayesian real-time crisis identification in distributed computing Described how to use this to perform rapid expected-cost-minimizing crisis intervention Very accurate on both simulated data and data from a production computing center A copy of this paper and seminar are available at: 58

59 References Escobar, M. D. and West, M. (1995). Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association, 90, Geyer, C. J. (1991). Markov chain Monte Carlo maximum likelihood. in Computing Science and Statistics, Vol. 23: Proc. of the 23rd Symp. on the Interface, ed. E. Keramidas, pp Jain, S. and Neal, R. M. (2004). A split-merge Markov chain Monte Carlo procedure for the Dirichlet process mixture model. Journal of Computational and Graphical Statistics, 13, Lau, J. W. and Green, P. J. (2007). Bayesian model-based clustering procedures. Journal of Computational and Graphical Statistics, 16, Zhang, J., Ghahramani, Z., and Yang, Y. (2004). A probabilistic model for online document clustering with application to novelty detection. in Advances in Neural Information Processing Systems, ed. Y. Weiss. 59

60 Prior Constants Prior hyperparameters chosen by combining information in data with expert opinion Reflect the fact that the server status measurements are chosen to be indicative of crisis type Results far better than a default prior specification, which contradicts data and experts 60

61 Prior Constants α: Prob. that 2 randomly chosen crises are of same type: 1/(α + 1) EHS experts estimate as 0.1, giving α = 9 13 types in 27 crises γ (jk) Dir(a (j) ). To choose a (j) : Prior mean of γ (jk) taken as empirical dist n of Y i1j over i and j Substantial prob. that one of the γ (jk) is close to 1: π (γ (jk) 1 >.85) OR (γ (jk) 2 >.95) OR (γ (jk) 3 >.85) = 0.5 Analogous for T (jk) 61

62 Optimal Decision Making Want expected-cost-minimizing decision making during a crisis The total cost of the new crisis is a function C ˆφ, {Zi } I i=1, Znew of: The intervention φ The true type Znew of the current crisis The vector of past crisis types {Zi }I i=1, which give the context for Z new 62

63 Optimal Decision Making If we knew C, given posterior sample vectors 1... {Z (l) i...the expected cost can be estimated as: } I i=1, Z (l) from the exact Method new E(C) 1 L LX h C φ, ({Z (l) i } I i=1, Z (l) l=1 new) i. Have a similar expression for approximate inferences from Method 2 63

64 Optimal Decision Making Don t know C in practice For interventions φ taken during previous crises can estimate C from realized costs Otherwise can estimate C from expert knowledge 64

65 Optimal Decision Making Since the goal is optimal intervention...and since this requires the entire posterior distribution over `{Zi} I i=1, Z new... we will avoid choosing a best cluster assignment instead focusing on the accuracy of the soft identification, i.e. the posterior distribution over `{Z i} I i=1, Z new 65

66 K-means: Criteria for choosing the number of clusters do not work well in our context So we apply K-means using the true number of clusters ( K-means 1 ) and half the true number of clusters ( K-means 2 ) This is unrealistically optimistic......but K-means still does terribly 66

Clustering Relational Data using the Infinite Relational Model

Clustering Relational Data using the Infinite Relational Model Ana Daglis Supervised by: Matthew Ludkin September 4, 2015 Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015