10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

Size: px

Start display at page:

Download "10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors"

Anne West
5 years ago
Views:

1 Dejan Sarka Anomaly Detection Sponsors

2 About me SQL Server MVP (17 years) and MCT (20 years) 25 years working with SQL Server Authoring 16 th book Authoring many courses, articles Agenda Introduction Simple problems One variable Previous knowledge supervised methods Unsupervised methods Principal Component Analysis Support Vector Machines Expectation Maximization Clustering Evaluating clustering models

3 Introduction Anomaly detection = outlier analysis Rare, far out of bounds cases Graphical methods Statistical methods Single variable distribution with descriptive statistics Regression analysis analysis of residuals Data Mining methods Supervised (directed) models existing outliers flagged Undirected models sort data from most to least suspicious What Is an Outlier? (1) About.96 of the distribution is between mean +- two standard deviations Standard Normal Distribution 0,45 0,4 0,35 0,3 0,25 0,2 0,15 0,1 0,

Skewness and Kurtosis Skewness is a measure of the asymmetry of the probability distribution of a real-valued random

of the graphs: Wikipedia What Is and Outlier (2) Outliers must be at least two (three, four, ) standard deviations from

Well, if the distribution is skewed, not true on one side If it has long tails on both sides (high Kurtosis), not true on

4 Skewness and Kurtosis Skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable Kurtosis is a measure of the tailedness" of the probability distribution of a real-valued random variable Source of the graphs: Wikipedia What Is and Outlier (2) Outliers must be at least two (three, four, ) standard deviations from the mean, right? Well, if the distribution is skewed, not true on one side If it has long tails on both sides (high Kurtosis), not true on both sides So, how much from mean should we go? No way to get a definite answer Start from other direction: how many rows can you inspect? One person can inspect ~5,000 rows per day Define outliers by the max. number of rows to inspect

5 Supervised (Directed) Models Flagged outliers = previous knowledge Explain the flag with input variables Multiple algorithms Decision Trees Naïve Bayes Neural Networks Logistic Regression And more Unsupervised (Undirected) Models No previous knowledge, or partial usage only No flag exists Use the flag to detect patterns on regular data only Limited set of general algorithms Principal Component Analysis Support Vector Machines Expectation Maximization Clustering Some specialized algorithms E.g., for analyzing time series data

6 Principal Component Analysis Principal component analysis (PCA) is a technique used to emphasize the majority of the variation and bring out strong patterns in a dataset It's often used to make data easy to explore and visualize Closely connected with eigenvectors and eigenvalues Variables form a multidimensional space, or matrix, of dimensionality m Principal Component Analysis V2 Var Var V1

7 Principal Component Analysis V2 PC1 PC2 Var V1 Support Vector Machines A support vector machine constructs a hyperplane or set of hyperplanes in a high-dimensional space, which can be used for classification, regression, or other tasks Discrete linear classifier A good separation is achieved by the hyperplane that has the largest distance to the nearest training data point of any class (socalled functional margin) The larger the margin the lower the generalization error

Support Vector Machines Clustering Grouping cases into clusters Objects within a cluster have a high similarity based on attribute values The

8 Support Vector Machines Clustering Grouping cases into clusters Objects within a cluster have a high similarity based on attribute values The class label of each object is not known Several techniques Partitioning methods Hierarchical methods Density-based methods Model-based methods

9 K-Means Divides dataset in predetermined ( k ) number of clusters around the average location ( mean ) Algorithm comes from geometry Imagine record space with attributes as dimensions Each record (case) is uniquely located in multidimensional space with values of the attributes (variables) K-Means K-Means initially randomly selects k means (centroids) Because it does not know the clusters yet, these must be fictitious cases It assigns each record to the nearest centroid These are the initial clusters It calculates new centroids of clusters It reassigns each record to the nearest centroid Some records jump from cluster to cluster It iterates the last two steps until cluster boundaries stop changing

10 K-Means Hard vs. Soft Clustering With the K-Means algorithm, each object is assigned to exactly one cluster This is hard clustering Instead of distance, you can use a probabilistic measure to determine cluster membership Cover the objects with bell curves for each dimension with a specific mean and standard deviation A case is assigned to every cluster with a certain probability Because clusters can overlap, this is soft clustering Expectation-Maximization (EM) method changes bell curve parameters to improve covering in each iteration

11 Expectation - Maximization X X X Clustering (2) How many clusters fit for input data the best? The more the model fits to the data, the more likely is that outliers really are exceptions or errors Create multiple models and evaluate them No good built-in evaluation method Use average entropy inside clusters to find the model with best fit (with potential correction factor) Of course, check the standard deviation of the entropy as well n H x = 1 (P(x i ) log 2 (P(x i ))) i=1

12 Q & A Come to SQL Saturday Ljubljana, December 9 th, 2017! Sponsors

CSE 158. Web Mining and Recommender Systems. Midterm recap

CSE 158. Web Mining and Recommender Systems. Midterm recap CSE 158 Web Mining and Recommender Systems Midterm recap Midterm on Wednesday! 5:10 pm 6:10 pm Closed book but I ll provide a similar level of basic info as in the last page of previous midterms CSE 158