An Unsupervised Approach for Combining Scores of Outlier Detection Techniques, Based on Similarity Measures

An Unsupervised Approach for Combining Scores of Outlier Detection Techniques, Based on Similarity Measures José Ramón Pasillas-Díaz, Sylvie Ratté Presenter: Christoforos Leventis 1

Basic concepts Outlier detection Construction of ensembles Two unsupervised approaches based on a weighted combination of outlier detection algorithms 2

Outlier detection The discovery of observations that deviates from normal behaviour Rapidly evolving through time Set of algorithms are being designed to detect these rare/crucial events 3

Approaches on outlier detection Supervised Algorithms Semi-Supervised Algorithms Use of labels for Outlier and Inliers High accuracy Label data are harder to obtain Train your model with mislabeled data + + Unlabeled data are easier to obtain Avoids the bias introduced by training your model with anomalous data Use of labels for Inliers Unsupervised Algorithms + - No single use of label 4

Outlier detection algorithms Local Outlier Factor (LOF) K-means Hierarchical clustering Modified box plot 5

Local Outlier Factor Local Outlier Factor Most performing outlier detection technique Density of the neighborhood Example with number of nearest neighbours = 2 : Manhattan Distance Example : X(a,b),Y(c,d) => a-c + b-d 6

Local Outlier Factor LOF(o) ~ 1 means Similar density as neighbors LOF(o) < 1 means Higher density than neighbors (Inlier) LOF(o) > 1 means Lower density than neighbors (Outlier) 7

K-means Distance based approach Divide data into groups depending on the closest centroid Outlierness of a point is equal to the distance to its closest centroid Example with centroids=3 (k=3): 1.K initial "means" are randomly generated within the data domain 2. K clusters are created by associating every observation with the nearest distance. 3. The centroid of each of the K clusters becomes the new mean 4. Steps 2 and 3 are repeated until convergence has been reached. 8

Hierarchical clustering Distance based approach Divide data into clusters until the data cannot be divided any further Outlier is when point presents more resistance to being merged into a cluster Example : 9

Modified boxplot Simple statistical base approach Example: 1. 2. 3. Data: { 3,12,15,16,16,17,19,34 } Min = 3 Max = 34 Q2 (median) = 16 Q1 = 13.5 Q3 = 18 1.5(IQR) Rule = 1.5 * (Q3-Q1) = 1.5 * (18-13.5)=6.75 Find the Lower and Upper outlier Lower = Q1-6.75 = 13.5-6.75 = 6.75 Upper = Q3 + 6.75 = 18 + 6.75 = 24.75 10

Categories of learning techniques Single Learning Ensemble Learning Boosting Stacking Bagging Feature Bagging Breadth First Feature Bagging cumulative sum 11

Single learner Output Y contains errors Each algorithm has its own bias Some algorithms have an overfitting effect 12

Ensemble learning Factors for ensembling: Accuracy Diversity Quality of the output Distinct results Complementary results Mix of algorithms whose errors are not identical 13

Why are ensembles important? Turn a weak learner into a strong learner Lower Error than any individual method by itself Increment of detection rate Less overfitting than any individual method by itself By combining the scores together you reduce the bias 14

Boosting 15

Stacking Meta model 16

Bagging 17

Feature Bagging Breadth First Sorts the outlier scores from all iterations of FB Takes the index of the highest score and inserts the index to a vector Final output: IndFINAL contains indices of the data records ASFINAL contains the probabilities of being outlier Sensitive to the order of outlier detection algorithms 18

Feature Bagging Cumulative sum Create a vector which each index is the sum of all the scores that correspond to each observation (NC) Sort the vectors of each algorithm. Finally uses ranking to identifies the outliers a.k.a sum(nc1) = AS1,1+AS2,4+...+ASt,2 NC1 is ranked : 1st outlier in algorithm 1, 4th outlier in algorithm 2, 2nd outlier in algorithm t NC2 is ranked: 4th outlier in algorithm 1. 19

Comparing Ensemble techniques Bagging Boosting Stacking Data Partitioning Random samples are drawn with replacement Every new subset contains the samples that were misclassified by previous models Various Goal Minimize Variance Increase predictive power Both Fusion of models (Weighted) Average (Weighted) Majority Vote Meta model to estimate the weights 20

Hypothesis & Solution Better performance can be achieved by joining the outputs of different algorithms and setting specific weights without prior knowledge of the output labels Two unsupervised ensemble approaches based on weighted combination of outlier detection algorithms 1. 2. Give weights based on the performance of each algorithm Increase the differentiation between inliers and outliers by creating of the ensemble with a varied set of algorithms 21

Approach: Algorithm 22

Standardization of scores Normalization method in ensemble outlier detection Different outlier detection algorithms produce score at different scale LOF tends to produce values close to 1 Hierarchical clustering produces values on larger scale Method : Z = (Xi - Mean ) / SD SD = Standard Deviation Large scale scores maintain a large value after joining the ensemble 23

Determine Votes 1. 2. 3. 4. Take the standardized scores for each algorithm Apply modified Box Plot in order to find the deviations that are greater than the rest An observation receives a vote IF [ its score > 1.5*IQR ] Output : Vector of votes with size m x T m = Number of Observations T = Number of Algorithms 24

Determine weights W Each outlier algorithm has assigned a weight with a score for each observation (it s not enough) Weight W vector : Increases the weight for outliers Maintains the weight for inliers Weight W vector is calculated with two approaches: Ensemble of Detectors with Correlated Votes (EDCV) Ensemble of Detectors with Variability Votes (EDVV) 25

Approach : EDCV 26

Approach : EDCV Matrix of correlations C with dim(m,n) is obtained by calculating the correlation between standardized scores F Take each row and sum the values W = { w1, w2,,wn } (w1,w2,wn are the weights corresponding to each algorithm) For each sum of the row Apply : (sum-1)/ T-1 27

Approach : EDVV 28

Approach : EDVV Matrix D with dim(m,n) is obtained by calculating the mean absolute deviation between standardized scores F and transform them to a compatible form by using the complement 1-MAD Take each row and sum the values W = { w1, w2,,wn } (w1,w2,wn are the weights corresponding to each algorithm) For each sum of the row Apply : sum / T-1 29

Votes vs Weights Step 8 : Determine votes(v) Step 9 : Determine weights(w) 1. 2. 3. Votes increase the difference between outliers and inliers Votes are produced individual for each observation Weights maintains the actual weight of an inliner and increases the weight of the outlier 30

Combining scores 1. 2. Calculate the product of each of the standardized scores F and their corresponding votes in matrix V The results values are updated by applying the weights W obtained with one of the approaches 31

Recap Algorithms that were used in ensemble LOF K-means clustering Hierarchical Clustering Modified boxplot T is the number of rounds and it was set to 4 Apply the Weights W obtained from one of the approaches (EDCV, EDVV) and update the values 32

Experiments Compared the results of their approaches (EDCV,EDVV) with Simple Averaging, FB Cumulative Sum, FB Breadth First Both FB algorithms were set to 50 iterations Simple Averaging,EDCV,EDVV were set to 4 iterations LOF, number of neighbours = 20 K- means, with K=11 Hierarchical Clustering & Modified boxplot (default) 33

Datasets Info 34

Evaluation (ROC) 35

Evaluation (AUC) 36

Evaluation Conclusions AUC EDVC & EDVV outperformed the rest in almost all datasets In Ann_thryroid dataset FB Breadth First showed strong dependence on the order on the outputs of the algorithms ROC EDVC & EDVV show better results in all datasets except Ann_thyroid and Satimages where only EDCV has higher results then the rest 37

Evaluation Conclusions EDCV & EDVV do not assume an exceptional and good perfmonance of the algorithms EDCV & EDVV assign weights to the algorithms based on their performance on each dataset EDCV & EDVV showed constant improvement in datasets that were originally designed for binary classification 38

Conclusion & Future work Conclusion Two unsupervised novels ensemble approaches for combining the output scores of different outlier algorithms: EDCV & EDVV Both approaches achieved (almost) better performance than similar methods with only 4 iterations instead of FB with 50 iterations Future work Use Feature Bagging variation in order to achieve better results 39

Discussion Motivation Soundness The authors proposed two novel completely unsupervised approaches for combining the outputs of different outlier detection algorithms (EDCV & EDVV) that outperformed similar methods with a reasonable number of iterations Technical Depth The paper reports experimental results on a varied datasets. The experiments were made on all proposed methods and compared to similar methods Novelty Their approach tries to achieve better accuracy on outliers not by comparing the output of the algorithms but instead assign weights based on their performance Non trivial but easy to follow Presentation Structured presentation with almost no figures and zero examples 40