Anomalydetection in massive datasets

Size: px

Start display at page:

Download "Anomalydetection in massive datasets"

Sarah Thomas
5 years ago
Views:

1 Anomalydetection in massive datasets Ennio Ottaviani OnAIR s.r.l. University of Genova, Dept. Of Mathematics, SMID 1

2 Outline 1. Introduction 2. Classical approaches 3. Emerging trends 4. Computational issues 5. Case studies 6. Conclusions Mining needle in a haystack. So much hay and so little time 2

3 Introduction Historically, anomaly detectionwas the field of statistics aimed to find and remove outliers as a way to improve data analysis Anomaly is an important notion in human understanding of the environment, indicating a deviation from the normal order or from a given rule There are now many fields where the anomalies are the objects of greatest interest (e.g. fraud detection) The rare events may be the ones with the greatest impact, and often in a negative way 3

4 Anomaliesclassification Anomalies are due to many causes: data from different objects or underlying mechanism natural intrinsic variability of data data measurement and collection errors Anomalies can be classified in: point anomalies contextual anomalies collective anomalies 4

5 Point anomalies An individual data instance is anomalous with respect to the other data O1 and O2 are anomalies O3 perhaps 5

6 Contextual anomalies An individual data instance is anomalous within a context (this requires a notion of context ) 6

7 Collective anomalies A (possibly large) collection of related data instances is anomalous. Requires a relationship among data instances (spatial, sequential, graph-based, etc.) The individual instances within a collective anomaly are not anomalous by themselves 7

8 Applications Network intrusion Insurance / credit card fraud Healthcare informatics / medical diagnostics Industrial damage detection Predictive maintenance Image processing / video surveillance Novel topic detection in text mining 8

9 Use of data labels Supervised anomaly detection Labels available for both normal data and anomalies Similar to classification but with high class imbalance Semi-supervised anomaly detection Labels available only for normal data Unsupervised anomaly detection No labels assumed Based on the assumption that anomalies are very rare 9

10 Input / Output data Observations numerical (discrete or continuous) categorical binary a general combination of them Label Each test instance is given a normalor anomalylabel Typical output of classification-based approaches Score Each test instance is assigned an anomaly score It allows outputs to be ranked An additional threshold parameter is required to decide 10

11 Some variants Given a dataset D, find all the data points x Dwith anomaly scores A(x) greater than some threshold t Given a dataset D, find all the data points x Dhaving the topn largest anomaly scores A Given a dataset D, containing mostly normal data points, and a new test point x, compute the anomaly score A(x) with respect to D 11

12 Unsupervised anomaly detection Training: build a profile of normal behavior summary statistics for overall population model of multivariate data distribution Testing: use the normal profile to detect anomalies anomalies are observations whose characteristics differ significantly from the normal profile Classical techniques are based on: statistical analysis proximity or density estimation clustering subspace methods 12

13 Statistical anomaly detection DEF: outliers are data that are poorly fit by a suitable statistical model Training: estimate a parametric model describing the distribution of the data Testing: apply a statistical test depending on properties of the test instance parameters of the model confidence interval 13

14 The Gaussian case With multivariate Gaussian distributions, outliers are computed by thresholding the Mahalanobis distance Only sample mean and covariance are needed 14

15 Maximumlikelyhood Assume D contains samples from a mixtureof two distributions: M (majority) and A (anomalies), weighted by a factor λ(expected outlier rate) M is estimated in any way, A is typically uniform Initially, assume all the data points belong to M and compute the log-likelyhoodl 0 (D) For each x t in D, move it to A and compute L t+1 (D) If the likelyhoodgain is higher than a threshold, leave x t in A, otherwise reset the move Repeat until no change have effect 15

16 Pros/ Cons Pros statistical tests are well-understood and well-validated reliable quantitative measure of degree to which an object is an outlier Cons data may be very hard to model parametrically, due to multiple modes of irregularly variable density in high dimensions, data may be insufficient to estimate the probability distribution(course of dimensionality) 16

17 Proximity-basedoutlierdetection DEF: Outliers are objects far away from other objects Training: define a suitable distance function Test: compute the outlier score as the distance to k th nearest neighbor 17

18 LocalOutlierFactor The Local Outlier Factor (LOF) is the most common densitybased detector 18

19 LOF vs knn In this example, p 1 is an outlier for both, while p 2 and p 3 change their meaning Distance from p 3 to nearest neighbor p 3 Distance from p 2 to nearest neighbor p 2 p 1 Starting from LOF, may other indicators many can be built 19

20 Density-basedoutlierdetection DEF: Outliers are objects in regions of low density Training: define a suitable non-parametric density function Test: compute the outlier score as an absolute or relative measure of the local density The simplest density is DBSCAN: number of objects inside a disk of fixed radius d 20

21 Pros/ Cons Pros it is easier to define a proximity or a density measure for a dataset than to evaluate its statistical distribution good quantitative measure of degree to which an object is an outlier deals with any data distribution Cons huge computational complexity of the test with massive datasets (a data structure like k-d tree may help) outlier score is highly sensitive to the choice of the parameters 21

22 Cluster-basedoutlierdetection DEF: Outliers are objects that do not belong strongly to any cluster Training: apply a clustering algorithm to partition the dataset into clusters Test: define the membership values of an object with respect to the clusters 22

23 Clusteringalgorithms Most common approaches (K-meansand related ones) operates on a fixed number of clusters K Clusters are initialized randomly and updated iteratively by modifying the object-cluster association The value of Kcan be changed by split/merge operations on the final clusters Hierarchical approaches start with K=1 and iterate merging start with K=max_number_of_objects and iterated splitting 23

24 Howmanyclusters? There is no clear way to assess the best number Too few? Just right? Too many? 24

25 Pros/ Cons Pros clustering techniques are well-known extends the outlier concept from single objects to groups Cons requires thresholds for minimum cluster size and distance very sensitive to the chosen number of clusters convergence issues of the clustering algorithm computationally intractable for large datasets 25

26 Subspaceoutlierdetection DEF: Data can be embedded into a lower dimensional subspace in which normal instances and anomalies are very different Training: apply a projection algorithm to embed the data into a suitable subspace Testing: evaluate the reconstruction error (related with distance between each data point and the subspace) 26

27 PCA Principal Component Analysis (PCA) is commonly used to define possible subspaces of increasing dimension Being based on data covariance matrix, it is optimal when this 2 nd -order statistics is appropriate (multivariate gaussiandata) Outliers are data violating the covariance structure Other approaches (e.g. ICA) extend the method in order to define subspaces in a broader sense 27

28 Pros/ Cons Pros well suited for unsupervised processing of highdimensional datasets a good standard data pre-processing(not only for anomaly detection) Cons strong underlying hypothesis on data structure computationally intensive prone to numerical instability (ill-conditioning) 28

29 Evaluation Some true labels are needed The global error is a very bad metric due to the high data imbalance (the trivial detector is very accurate!) It is important to quantify separately two type or errors False negatives (FN) False positive (FP) Many statistical decision indexes can be computed in order to quantify detection performances (precision, recall,...) 29

30 ROC curve In order to compare different detection methods, it is common practice to plot the ROC curve X: False Positive Rate (% of false alarms) Y: True Positive Rate (% of detections) The area under the curve (AUC) quantifies the goodness of the method AUC is computed by trapezoids ROC curves for different outlier detection techniques 1 Detection rate AUC False alarm rate 30

31 SupportVectorMachines SVMs are a very good binary supervised classification scheme The basic algorithm finds an unique hyperplaneseparating two classes (with the largest margin) 31

32 SupportVectors The best hyperplaneis defined by scalar products with a few data points (support vectors) If the dataset is not linearly separable, an unique plane can be found as well (slack variables) 32

33 Kerneltrick If a linear solution is not sufficient to achieve good classification performances, data can be transformed by the kernel trick 33

34 One-classSVM Unsupervised anomaly detection can be seen as a classification between one class (the normality) and the rest of the world A SVM can find a surface enclosing as much data as possible with the minimum volume Regularizationensures that solution generalizes well, avoiding overfitting 34

35 The one-classbasicalgorithm Map input data to the feature space by the kernel Separate features from the origin by a hyperplane Penalize outliers with slack variables Look for a good tradeoff between accuracy and penalty 35

36 Pros/Cons Pros Suitable for high dimensional data Flexibility of the kernel trick Fast execution by support vector selection Unique solution Optimal generalization capabilities Cons Complex tuning of many parameters in training Large memory requirements Very complex mathematical treatment 36

37 Multi-Layer Perceptron MLP is a neural network able to learn a mappingof the feature space into the output target 37

38 Autoencoders Autoencodersare unsupervised MLP with input = output, trained by backpropagation algorithm The bottlenecklayer identifies the best subspace (like a sort of non-linear PCA) 38

39 Reconstructionerror Autoencodersidentifies anomalies by looking to the reconstruction error after an encoding-decoding process 39

40 Pros/Cons Pros General and flexible approach Capability to handle complex data behaviour Cons Choice of many parameters during network design Very complex mathematical treatment Convergence issues of the training phase (overfitting) 40

Classificationtrees The key idea is to split the feature space into a (large) number of smaller regions, each one having a defined label (in the supervised case) The best set of

41 Classificationtrees The key idea is to split the feature space into a (large) number of smaller regions, each one having a defined label (in the supervised case) The best set of splitting rules learned automatically from data, can be represented by a tree Trees are useful for a direct data interpretation but their predictive accuracy in complex problems is poor 41

42 Bagging Bootstrap aggregation(shortly bagging) is a method to increase the predictive accuracy of every statistical learning procedure If we would have N independent observation L 1..L N of the same unknown label L, each one with a given confidence, confidence of the mode should be higher We can generate multiple predictions by bootstrapping (sampling with replacement) the original data set. In each one some data may appear more than once and some other not at all. Every dataset lead to a single classification tree 42

43 Randomforests Often bagged trees tend to be quite similar (bagging does not decorrelate the trees efficiently enough) Random forests provide improvements over bagging by adding to bootstrapping of data a random selectionof the features used in each split The forest trees are trained with a subset of data and a subset of features, and so they are much more independent than bagged trees 43

44 Isolationtrees With unsupervised data, the RF scheme cannot be applied directly, but it is possible to generate random trees to isolate single data points 44

45 Isolationforests With isolation forests, the anomaly score is related with the averaged path length to reach a given point 45

46 Pros/Cons Pros Suitable for mixed-type features Minimal parametrization Simple to understand and to implement Very fast training and testing processes No risk of overfitting Cons. (hard to be found) 46

47 Computationalissues Every time we cannot define a model, we have to retain all the dataset to perform any computation (e.g. knnsearch or density estimation) This has seriously restricted the use of massive datasets in machine learning applications Many different (and complementary) approaches try to cope with this: Data summarization (not every data point is needed) Fast search methods (quick response to a NN query) Online learning (learn a partial model and refine it) Algebrical shortcuts (approximate matrix-vector products) 47

48 Data selection Being the dataset D a large matrix with columns (attributes) and rows (observations), we can summarize it by: reducing colums(feature selection well known) reducing rows (data selection) Data selection methods random (fast and preserving the original statistics) quantization (forcing similar items to became identical) clustering (replacing single points with cluster centroids) 48

49 k-d trees Ak-d treeis apartitioningdata structurefor organizing pointsin ak-dimensionalspace, useful for several applications, such as multidimensional search Thek-d tree is abinary treein which every node is ak-d point. Every non-leaf node can be thought of as implicitly generating a splitting hyperplane that divides the space into two halfspaces. points to the left of this hyperplaneare represented by the left subtreeof that node points right of the hyperplane are represented by the right subtree 49

50 k-d treeconstruction There are many way to construct a k-d tree for a given dataset. The most common ensures: if the root would have anx-aligned plane, the root's children would both havey-aligned planes and so on points are inserted by selecting themedianof the points in the subtree, with respect to the axis in use 50

51 Fast NN search NN search can be done efficiently by using the tree properties to quickly eliminate large portions of the search space Complexity reduced to O(log 2 N) with random data 51

52 Case study: BackBlaze BackBlazedataset: HDD failure detection in a datacenter based on SMART data (up to ~100 features) When SMART data indicates a possible imminent drive failure, stored data can be copied to another device, preventing data loss Sample: fixed HDD maker/model (~4000 units alive) Dataset: > 2 M records (daily sampling) Failures: ~50 Remaining Useful Life (RUL) must be estimated 52

Sectors Count; Read Error Rate; Current Pending Sector Count;

53 Problemspecificissues S.M.A.R.T data are very mixed and difficult to be normalized Reallocated Sectors Count; Read Error Rate; Current Pending Sector Count; Power-On Hours(POH).. Only a small part of failures appears to be predictable by physical reasons 53

54 IF anomalyscore IF score appears to by strongly related with the RUL 54

55 Classificationperformance When IF anomaly score is used in a binary decision test for RUL < t days, results are promising TPR = 90%, FPR = 0.6% for t=1 day The IF scheme fits well with user needs However, such a small FPR implies that many HDD are replaced well before the failure A cost tradeoff analysis is needed to define the best replacement strategy 55

56 Case study: CMAPPS CMAPPS dataset: turbofan engine failure detection with a feature set computed by a NASA simulator (public domain) The simulator generates stress test for each unit, allowing the monitoring of engine behavior Sample: >500 experiments, time cycles Dataset: ~100K records (20 columns) Failures: at the end of each experiment Data used for the PHM08 contest 56

Problemspecificissues The engineoperatesnormallyand thenitdevelopsa fault at some point during the experiment In the training set, the fault growsin

57 Problemspecificissues The engineoperatesnormallyand thenitdevelopsa fault at some point during the experiment In the training set, the fault growsin magnitudeuntilsystem failure. In the test set, data are stoppedsome timepriorto system failure, and RUL should be estimated During each experiment the engine changes operational settings 57

58 IF results TrainanIF and usethe score asa healthstatus index 58

59 Conclusions Anomaly detection is an important component of many predictive analytics systems Some emerging techniques allow robust detections even in very complex scenarios Computational issues with massive dataset can be faced by proper data/feature selection and fast search methods A strong connection between domain experts and data scientists is a key value to find the best data processing strategy 59

60 Thankyou! Visy Visit us at 60

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation Learning 4 Supervised Learning 4 Unsupervised Learning 4