Authors: Coman Gentiana. Asparuh Hristov. Daniel Corteso. Fernando Nunez

Size: px

Start display at page:

Download "Authors: Coman Gentiana. Asparuh Hristov. Daniel Corteso. Fernando Nunez"

Victor Hampton
5 years ago
Views:

1 OUTLIER DETECTOR DOCUMENTATION VERSION 1.0 Authors: Coman Gentiana Asparuh Hristov Daniel Corteso Fernando Nunez Copyright Team 6, 2011

2 Contents 1. Introduction Global variables used Scientific explanation of algorithms Standard deviation and standard normal distribution Median and quartiles Score function K-th nearest neighbor Local Outlier Factor Local Correlation Integral Excel built-in functions used Future implementations References Notes... 18

3 1. Introduction The document contains information regarding the Outlier Detector Excel add-on. It is aimed at users with basic knowledge of Excel and Visual Basic for Applications or future developers of the application. The document is structured as follows: Chapter 2 - Global variables used Explanation of the VBA variables that were used in the application. Chapter 3 - Scientific explanation of algorithms The mathematical reasoning behind the algorithms used to detect outliers together with advantages and disadvantages of each approach. Chapter 4 - Excel built in functions used A list of the built in Microsoft Excel functions used in the application. The test and development of the application was done entirely in Microsoft Excel and Visual Basic The tool is available in the Outlier Detector Tool.xls as well, but might produce other types of results. We recommend using it with Microsoft Office Global variables used In order to safely store the input and output related variables we have used global variables which are shared by all the functions implemented. These are: outputbook As Workbook Variable used to store the reference to the output workbook selected by the user. outputsheet As Worksheet Variable used to store the reference to the output worksheet, from the output workbook, as selected by the user. inputbook As Workbook Variable used to refer to the workbook containing the input data set. inputsheet As Worksheet Variable used to refer to the worksheet containing the input data set, from the input workbook. DTBook As Workbook Variable used to store the reference to the workbook containing the distance tables, if the user wishes to skip the creation of them. 1

4 inputrange As Range Variable used to store the reference to the input range selected from the input wlorksheet. fdialog As Office.FileDialog Global file dialog object which will enable the user to browse for files. algorithm As Integer Variable used to store the value of the selected algorithm from the first tab. The values which the variable can take are the following: 1. Standard deviation algorithm 2. Mean and quartiles algorithm 3. Score function algorithm 4. K-th nearest neighbor algorithm 5. Local Outlier Factor algorithm 6. Local Correlation Integral algorithm 3. Scientific explanation of algorithms This section will contain the explanation of the algorithms used. Each algorithm will contain an introductory note, a mathematical reasoning, a section with advantages and a section with disadvantages Standard deviation and standard normal distribution Introduction The standard normal distribution is a special case of the normal distribution 2

5 It is a normal distribution with zero mean and unit variance, given by the probability density function and distribution function over the domain. The normal random variable of a standard normal distribution is called a standard score or a z-score. Every normal random variable X can be transformed into a z score via the following equation: z = (X - μ) / σ where X is a normal random variable, μ is the mean mean of X, and σ is the standard deviation of X. Applying the formula will always produce a transformed distribution with a mean of zero and a standard deviation of one. However, the shape of the distribution will not be affected by the transformation. If X is not normal then the transformed distribution will not be normal either. One reason the normal distribution is important is that it is easy for mathematical statisticians to work with. This means that many kinds of statistical tests can be derived for normal distributions. The approach that was taken in the Outlier Detector application is explained in the following paragraphs. Mathematical Reasoning Using the values of X from the input dataset, the mean and the standard deviation are computed using the following formulas: N μ = 1 N i=1 x i 3

6 For each X value, its z-score is calculated. If the score is greater than 0, we take into consideration the values that lie between z and. To find the probability that a Z value lies between the z-score z of value X and, we calculate the following factor: Nz = P(Z>z) The probability that a standard normal random variable (Z) is greater than a given value (z), which is 1-P(Z<z). P(Z<z) represents the cumulative probability associated with a z-score P is the probability density function mentioned in the introduction For z-scores greater than 0 we will use the normal distribution function implemented in Microsoft Excel, which will be subtracted from the value of 1, while for z-scores lower than 0 will use only the value returned by the normal distribution function implemented in Microsoft Excel. Therefore: If z>0: If z<=0: Nz = 1 NormSDist(z) Nz = NormSDist(z) where NormSDist is the standard normal distribution function from Microsoft Excel. The value which results in Nz will indicate how many other points lie at a greater distance from the average than the point X taken into consideration. If this value is lower than 0.05, meaning 5%, this indicates that X is a potential outlier, because it lies far from the average and only other 5% other points are situated farther than him. This also indicates that the point could not belong to a standard normal distribution, therefore making it an outlier according to this algorithm. Advantages The advantage of the standard normal distribution is that we can compare different normal distributions with different means and variance after transforming them into standard normal distributions. Also there exists a probability table already computed for standard normal distributions, therefore it is easy to find the probability or z-score value. In the context of detecting outliers, it is useful to see if the given dataset follows a standard normal distribution, because one such distribution has the property that over 96% of its values fall between -2 and +2 standard deviations from the average. This indicates that if the dataset follows such a distribution, then all the values are close to each other, showing a strong probability that they are correct data. On the other hand, if the dataset does not conform to the distribution, there is a strong possibility that outliers exist among the values and they are detectable. 4

7 Disadvantages The most important disadvantage of this approach is that it can only be used with onedimensional datasets, because of the average and standard deviation computations. It cannot detect density based anomalies and relies only on the standard normal distribution and the probability density function Median and quartiles Introduction The algorithm uses the box plot method. A box plot is a way of summarizing a set of data measured on an interval scale. It is often used in exploratory data analysis. Box plots are also very useful when large numbers of observations are involved and when two or more data sets are being compared. They are helpful for indicating whether a distribution is skewed and whether there are any unusual observations (outliers) in the data set. The statistical terms used are: Median: The median value in a dataset is such that there are equal number of values greater than the median as are less than the median. When the dataset is sorted, the median is the middle value in the dataset. If the dataset has even number of values then the median is the average of the two middle values in the dataset. Quartiles: Quartiles, by definition, separate a quarter of data points from the rest. This roughly means that the first quartile is the value under which 25% of the data lie and the third quartile is the value over which 25% of the data are found (this also indicates that the second quartile is the median itself). First Quartile, Q1: Concluding from the definitions above, first quartile is the median of the lower half of the data. If the number of data points is odd, the lower half includes the median. Third Quartile, Q3: Third quartile is the median of the upper half of the data. If the number of data points is odd, the upper half of the data includes the median. 5

As the above picture indicates: The median divides the data into a lower half and an upper half. The lower quartile is the middle value of the lower half.

8 As the above picture indicates: The median divides the data into a lower half and an upper half. The lower quartile is the middle value of the lower half. The upper quartile is the middle value of the upper half. Mathematical Reasoning The Outlier Detector application uses the box plot method of calculating lower inner and upper fences instead of using the interquartile range. The median, first and third quartile are generated using the Microsoft Excel specific functions. With the help of the two determined quartiles, the interquartile range is determined, using the following formula: IQ = Q3 - Q1 where IQ stands for interquartile range, a useful measure of the amount of variation in a set of data. Using the value of the interquartile range, the lower and upper inner fences are computed: Lower Inner Fence: LI = Q1-1.5*IQ Upper Inner Fence: UI = Q *IQ The values situated between the two fences are considered to be normal, whereas the values outside of the fences are seen as outliers. 6

9 The method implemented here is a variation of the box plot, because the box plot also takes into consideration outer fences situated at distances of three times the interquartile range from each quartile and makes a distinction between strong and weak outliers. The reasoning for choosing only the inner fences as thresholds lies in the fact that the distinction is not considered very important and also because the outer fences are very restrictive and might not prove useful for outlier detection in some types of datasets. Advantages One of the advantages of this approach would be the fact that, in comparison with the average, the median is not affected by the extreme values in the dataset. The interquartile range ignores the extreme values, whereas the inner fences provide a reasonable range for normal data points. Disadvantages A disadvantage of this approach is the fact that it tends to emphasize the tails of a distribution, which are the least certain data points in a set. It also hides many details of the distribution and can only be applied to one-dimensional data sets. At the same time, clustering of data points cannot be taken into consideration. 3.3 Score function Introduction Scoring techniques assign an anomaly score to each instance in the test data depending on the degree to which that instance is considered an anomaly. In the case of the Outlier Detector, the score consists of the number of standard deviations the value lies from the average. The standard deviation is a statistic used as a measure of the dispersion or variation in a distribution, equal to the square root of the arithmetic mean of the squares of the deviations from the arithmetic mean. An important attribute of the standard deviation as a measure of spread is that if the mean and standard deviation of a normal distribution are known, it is possible to compute the percentile rank associated with any given score. In a normal distribution, about 68% of the scores are within one standard deviation of the mean and about 95% of the scores are within two standard deviations of the mean. The standard deviation has proven to be an extremely useful measure of spread in part because it is mathematically tractable. Its formula is: 7

10 Mathematical Reasoning The score function implemented in the Outlier Detector relies on the standard deviation and the distance each value has from the average. The first step is calculating the average and standard deviation of the dataset, after which for each value, its distance to the average is computed. The average and standard deviation are computed using the Microsoft Excel specific built-in functions. The score for each value is computed dividing its distance to the standard deviation and extracting the absolute value. Therefore the scores are positive numbers. This is done to indicate how far apart each value lies from the average, in number of standard deviations. If a value lies very far away from the average, then its score will be relatively high, possibly indicating an outlier. Determining the outliers can be performed in two manners: Taking into consideration the maximum score from the dataset and setting the threshold at half of this maximum score. Taking into consideration a minimum score as the threshold, indicated as input. Any score above the threshold is considered an outlier. Advantages An advantage of standard deviation, in comparison with other statistical indicators such as range, is that it makes use of all data to calculate the spread of data from average. The range for example only uses two data items, i.e. the largest value data and the smallest value data, therefore standard deviation is a more accurate measure. Standard deviation gives weight to the deviation of the data from the mean by squaring it, therefore the greater the deviation, the greater the weight after the squaring. This is a better indicator for values which lie far away from the average, because squaring the difference makes the value more noticeable. Disadvantages Standard deviation, as well as the average, are strongly influenced by the extreme values of a dataset. At the same time it can only be applied to one-dimensional data sets. 8

11 3.4 K-th nearest neighbor Introduction The nearest neighbor method can be used to detect outliers and it can achieve great results. As the name suggests it determines the anomalies by comparing the distance of each one to its k-th neighbor. Therefore one needs to define how the distance would be computed. The metric used has to be positive-definite and symmetric but it is not required to satisfy the triangle inequality. There are lots of variations of the algorithm, aiming at better and/or faster detection, but all of them can be broadly classified in one of the two possibilities: 1) Techniques that use the distance of a data instance to its k-th nearest neighbour as the anomaly score. 2) Techniques that compute the relative density of each data instance to compute its anomaly score. We have implemented both variations. The simple k-th neighbour algorithm discussed in this section is from the former type. Mathematical Reasoning First of all we need to define a metric before computing the actual distances. The algorithm normally works under the assumption : Normal data instances occur in dense neighborhoods, while anomalies occur far from their closest neighbors. However, driven by the concrete topic of our project, we were inspired to adapt as much as possible the algorithm to the specific theme, so it can evaluate better and more meaningful results. Therefore, we have made more assumptions. Before discussing them, note that the project is Outliers detection in health insurance. We were given a data set with patients and their spending in different kind of medicines. In our opinion the metric used, should take the squared differences between two patients for each medicine. In this way we make larger difference more important. We assume that the health insurance company is more interested in patient, who has spent 20 euro more in a given medicine, than one spent 1 euro more in 20 different medicines (20^2>1^2*20). The next slight modification of the algorithm is that if a given patient hasn t spent any money on a particular medicine, than no distance should be added for this category. This assumption was made by taking into account that for a given medicine the patient either needs it or not. This way he either spends some normal amount of money on it, or he doesn t spend any money at all. If he hasn t purchased the medicine he is of no interest for the insurance company and no additional score should be added. 9

12 In other words the distance between two patients x and y with values x i and y i in attribute i is: x i 0 (x i y i ) 2 Now we can compute and store the distances. Note that we have implemented the method to store information only for the first 150 neighbors because of storage reasoning. The algorithm initializes two tables: one with sorted distances and one with the corresponding number (id) of the neighbor. The rows are used to represent different patients and the columns stands for the neighbors, because if it was the other way round we would have reached the excel column limit with some of the input files. This means that for example the value in the i-th row and j-th column corresponds to the distance between i-th patient and its j-th nearest neighbor in one of the tables, and the actual number of this j-th nearest neighbor in the other table. Having the tables computed the algorithm just sorts the distances to the k-th neighbor for each patient and the ones with the biggest distances are the outliers. One additional problem that can arise is that one has to choose the number of outliers he wants to detect. Therefore we have implemented some graphical output options particular for this issue, helping the user to decide what the best number should be. Advantages 1) It performs very well in terms of detected anomalies, since the likelihood of an anomaly to form a close neighborhood in the training data set is very low. 2) Adapting nearest neighbor based techniques to a different data type is straight-forward and can lead to great results Disadvantages 1) It doesn t take into account that different clusters may have different density and therefore sometimes normal can be detected as outlier. 2) The initializing of the tables is very time-consuming. 3) If the normal instances in the data do not have enough neighbors with similar normal attributes, the false positive rate for such techniques is high. 10

3.5. Local Outlier Factor Introduction As indicated by the title, the local outlier factor is based on a concept of a local density, where locality is given by the k nearest neighbors, whose distance

By comparing the local density of an object to the local densities of its neighbors, one can identify regions of similar density, and points that have a substantially lower density than their

13 3.5. Local Outlier Factor Introduction As indicated by the title, the local outlier factor is based on a concept of a local density, where locality is given by the k nearest neighbors, whose distance is used to estimate the density. By comparing the local density of an object to the local densities of its neighbors, one can identify regions of similar density, and points that have a substantially lower density than their neighbors. These are considered to be outliers. The local density is estimated by the typical distance at which a point can be "reached" from its neighbors. The definition of "reachability distance" used in LOF is an additional measure to produce more stable results within clusters. Basic idea of LOF: comparing the local density of a point with the densities of its neighbors. Mathematical Reasoning Let k-distance(a) be the distance of the object A to the k nearest neighbor. Note that the set of the k nearest neighbors includes all objects at this distance, which can in the case of a "tie" be more than k objects. We denote the set of k nearest neighbors as Nk(A). Illustration of the reachability distance. Objects B and C have the same reachability distance (k=3), while D is not a k nearest neighbor 11

14 This distance is used to define what is called reachability distance: reachability-distancek(a,b) = max{k-distance(b),d(a,b)} In words, the reachability distance of an object A from B is the true distance of the two objects, but at least the k-distance of B. Objects that belong to the k nearest neighbors of B are considered to be equally distant. The reason for this distance is to get more stable results. Note that this is not a distance in the mathematical definition, since it is not symmetric. The local reachability density of an object A is defined by Which is the quotient of the average reachability distance of the object A from its neighbors. Note that it is not the average reachability of the neighbors from A (which by definition would be the k-distance(a)), but the distance at which it can be "reached" from its neighbors. The local reachability densities are then compared with those of the neighbors using Which is the average local reachability density of the neighbors divided by the objects own local reachability density. A value of approximately 1 indicates that the object is comparable to its neighbors (and thus not an outlier). A value below 1 indicates a denser region (which would be an inlier), while values significantly larger than 1 indicate outliers. Advantages Due to the local approach, LOF is able to identify outliers in a data set that would not be outliers in another area of the data set. For example, a point at a "small" distance to a very dense cluster is an outlier, while a point within a sparse cluster might exhibit similar distances to its neighbors. While the geometric intuition of LOF is only applicable to low dimensional vector spaces, the algorithm can be applied in any context a dissimilarity function can be defined. It has experimentally been shown to work very well in numerous setups, often outperforming the competitors, for example in network intrusion detection. Disadvantages The resulting values are quotient-values and hard to interpret. A value of 1 or even less indicates a clear inlier, but there is no clear rule for when a point is an outlier. In one data set, a value of 1.1 may already be an outlier; in another dataset and parameterization (with strong local fluctuations) a value of 2 could still be an inlier. These differences can also occur within a dataset due to the locality of the method. 12

15 3.6. Local Correlation Integral Introduction In this section, we describe the LOCI (LOcal Correlation Integral method) algorithm for detecting outliers. This algorithm computes exact MDEF and σmdef values for all objects, and then reports an outlier whenever MDEF is more than three times larger than σmdef for the same radius. Thus the key to a fast algorithm is an efficient computation of MDEF and σmdef values. Mathematical Reasoning MDEF Multi-granularity deviation factor (MDEF), which can cope with local density variations in the feature space and detect both isolated outliers as well as outlying clusters. LOCI algorithm uses an exact computation of MDEF values. The MDEF is associated with the correlation integral, it is an aggregate measure. Let the r-neighborhood of an object pi be the set of objects within distance r of pi. Intuitively, the MDEF at radius r for a point pi is the relative deviation of its local neighborhood density from the average local neighborhood density in its r-neighborhood. Thus, an object whose neighborhood density matches the average local neighborhood density will have an MDEF of 0. In contrast, outliers will have MDEFs far from 0. Let n (pi, αr) be the number of objects in the αr-neighborhood of pi. Let n^ (pi, r, α) be the average, over all objects p in the r-neighborhood of pi, of n (p, αr) (see Figure below). We have two radii serves to decouple the neighbor size radius αr from the radius r over which we are averaging. The local correlation integral is the function n^ (pi, α,r) over all r. 13

16 Definition MDEF: For any pi, r and α, we define the multi-granularity deviation factor (MDEF) at radius (or scale) r as: For faster computation of MDEF, we will sometimes estimate both n (pi, αr) and n^ (pi, r, α). This leads to the following definitions: Definition Counting and sampling neighborhood: The counting neighborhood (or αr-neighborhood) is the neighborhood of radius αr, over which each n (p, αr) is estimated. The sampling neighborhood (or r-neighborhood) is the neighborhood of radius r, over which we collect samples of n (p, αr) in order to estimate n^ (pi, r, α). σ n^ (pi, r, α) is the normalized standard deviation: Given the above definition of MDEF, we still have to make a number of decisions. In particular, we need to answer the following questions: 1. Sampling neighborhood: Which points constitute the sampling neighborhood of pi, or, in other words, which points do we average over to compute n^ (and, in turn, MDEF) for a pi in question? For each point and counting radius, the sampling neighborhood is selected to be large enough to contain enough samples. We choose α = 1/2 in all exact computations. 2. Scale: Regardless of the choice of neighborhood, over what range of distances do we compare n and n^? This leads to the following definition, where N is the number of objects and NN (pi, m) is the m-th nearest neighbor of pi. Definition Critical Distance: For 1 m N, we call d(nn(pi, m), pi) a critical distance of pi and d(nn(pi, m), pi)/α an α-critical distance of pi. We need only consider radii that are critical or α-critical. In a pre-processing pass, we determine the critical and α-critical distances Di for each object pi. In our algorithm the scale ([rmin, rmax]) depend on the nmin (minimal number of neighbors of a point) and nmax (maximal number of neighbors of a point). This leads to rmin = d (NN (pi, nmin), pi) and rmax = d (NN (pi, nmax), pi) 14

17 3. Flagging: After computing the MDEF values (over a certain range of distances), how do we use them to choose the outliers? Then considering each object pi in turn, and considering increasing radius r from Di, we maintain n(pi, αr), n^(pi, r, α), MDEF(pi, r, α), and σmdef(pi, r, α). A point is flagged as an outlier, if for any r In the algorithm, we use kσ= 3 The method selects a point as an outlier if its MDEF value deviates significantly (more than three standard deviations) from the local averages. Advantages Like the state of the art, it can detect outliers and groups of outliers (or, micro-clusters). It also includes several of the previous methods (or slight variants thereof) as a special case. Going beyond any previous method, it proposes an automatic, data-dictated cut-off to determine whether a point is an outlier in contrast, previous methods let the users decide, providing them with no hints as to what cut-off is suitable for each dataset. This method successfully deals with both local density and multiple granularities. The exact LOCI method can be computed as quickly as previous methods (but not always). Extensive experiments on synthetic and real data show that LOCI can automatically detect outliers and micro-clusters, without user-required cut-offs, and that they quickly spot outliers, expected and unexpected Disadvantages The only disadvantage is that for large datasets, the computations are complex and time consuming and it will take a lot of time to process, even with the distance tables computed. 15

18 4. Excel built-in functions used This section contains information regarding the Excel built in functions that were used in the VBA code. These functions have simplified the work load and also perform in a timely manner. Each function is explained in the corresponding table, mentioning also the call for it in Visual Basic. Mathematical and statistical functions Name Call Description Average WorksheetFunction.average() Computes the average of the input data set, which is referenced in the inputrange variable. Standard deviation WorksheetFunction.stDev() Computes the standard deviation of a range of cells, taking as parameter the inputrange variable. Standard normal distribution WorksheetFunction.NormSDist() Computes the standard normal distribution factor for the z-score of each data point. Quartile WorksheetFunction.Quartile() Computes the specified quartile from a range of cells. The quartile is mentioned as an integer number, 1 or 3 in the case of the second algorithm Median WorksheetFunction.median() Computes the median for the range of cells specified as input, in our case the inputrange variable Round Down WorksheetFunction.RoundDown() Rounds down a given real number to the specified amount of decimal places Round Up WorksheetFunction.RoundUp() Rounds up a given real number to the specified amount of decimal places Sort fields Worksheet.Sort.SortFields.Add() Sorts a range of cells based on the selected column and conditions Maximum WorksheetFunction.Max() Computes the maximum value from the range parameter, in our case the inputrange 16

19 Graphical representation functions Name Call Description Add chart ActiveSheet.Shapes.AddChart Creates a new chart in the output sheet, which will be populated with the data points from the inputrange Color outliers ActiveChart.SeriesCollection(1).Poin ts(point).markerbackgroundcolor Used to color with red the outlier points from the scatter plot. Cut to clipboard Selection.Cut Cuts the selected chart to the clipboard for future placement inside the user form. Special module In order to represent the graphs in user form, we have used the module compastepicture which copies the graph from clipboard and places it inside forms. The module was created by Stephen Bullen and was made available as open source code on the Internet. 5. Future implementations The program is complex enough but it always can be improved in future implementations. There are some ideas to enhance the usefulness of this program. One of these ideas could be the use of text files as input or output. Now, we need to have a worksheet as input but the program will give a better service if the input would not have to be a worksheet. Many times the information given for the input will be in a plain text and it will be very tedious to write all the text in a worksheet by hand. Therefore, it could be a good new feature of the program for future implementations. In the other hand, the output could be written also in a plain text. This will help to store the data in some cases, and it could be used again as input for the program. For example, the distance computed tables given by the k nearest algorithm could be stored in plain text and used later as input for the LOF algorithm. 17

20 6. References [1] V. Chandola, A. Banerjee, V. Kumar (2009). Anomaly Detection: A Survey. ACM Computing Survey [2] M.M. Breuning, H. Kriegel, R.T. Ng, J. Sander (2000). LOF: Identifying density-based local outliers. Proceedings of ACM SIGMOD, Texas. [3] S. Papadimitriou, H. Kitagawa, P.B. Gibbons, C. Faloutsos (2003). LOCI: Fast outlier detection using local correlation integral. Proceedings of the 19th ICDE. [4] Russell Langley, Practical Statistics Simply Explained, Dover Publications; Revised edition (June 1, 1971) 7. Notes The application has been created as the project for the Optimization of Business Processes course held at Vrije University in January The creators of the application are Gentiana Coman Asparuh Hristov Daniel Corteso Fernando Nunez All the copyrights go to the before mentioned persons. The application is free to use. 18

STA Rev. F Learning Objectives. Learning Objectives (Cont.) Module 3 Descriptive Measures

STA Rev. F Learning Objectives. Learning Objectives (Cont.) Module 3 Descriptive Measures STA 2023 Module 3 Descriptive Measures Learning Objectives Upon completing this module, you should be able to: 1. Explain the purpose of a measure of center. 2. Obtain and interpret the mean, median, and