OUTLIER DETECTION. Short Course Session 1. Nedret BILLOR Auburn University Department of Mathematics & Statistics, USA

Size: px

Start display at page:

Download "OUTLIER DETECTION. Short Course Session 1. Nedret BILLOR Auburn University Department of Mathematics & Statistics, USA"

Blaze Curtis
5 years ago
Views:

1 OUTLIER DETECTION Short Course Session 1 Nedret BILLOR Auburn University Department of Mathematics & Statistics, USA Statistics Conference, Colombia, Aug 8 12, 2016

2 OUTLINE Motivation and Introduction Approaches to Outlier Detection Sensitivity of Statistical Methods to Outliers Statistical Methods for Outlier Detection Outliers in Univariate data Outliers in Multivariate Classical and Robust Statistical Distancebased Methods PCA based Outlier Detection Outliers in Functional Data

3 MOTIVATION & INTRODUCTION Hadlum vs. Hadlum (1949) [Barnett 1978] Ozone Hole

5 Case I: Hadlum vs. Hadlum (1949) [Barnett 1978] The birth of a child to Mrs. Hadlum happened 349 days after Mr. Hadlum left for military service. Average human gestation period is 280 days (40 weeks). Statistically, 349 days is an outlier.

6 Case I: Hadlum vs. Hadlum (1949) [Barnett 1978] blue: statistical basis (13634 observations of gestation periods) green: assumed underlying Gaussian process Very low probability for the birth of Mrs. Hadlums child for being generated by this process red: assumption of Mr. Hadlum (another Gaussian process responsible for the observed birth, where the gestation period responsible) Under this assumption the gestation period has an average duration and highest possible probability

7 Case II: The Antarctic Ozone Hole The History behind the Ozone Hole The Earth's ozone layer protects all life from the sun's harmful radiation.

8 Case II: The Antarctic Ozone Hole (cont.) Human activities (e.g. CFS's in aerosols) have damaged this shield. Less protection from ultraviolet light will, over time, lead to higher skin cancer and cataract rates and crop damage.

9 Case II: The Antarctic Ozone Hole (cont.) Molina and Rowland in 1974 (lab study) and many studies after this, demonstrated the ability of CFC's (Chlorofluorocarbons) to breakdown Ozone in the presence of high frequency UV light. Further studies estimated the ozone layer would be depleted by CFC's by about 7% within 60yrs.

10 Case II: The Antarctic Ozone Hole (cont.) Shock came in a 1985 field study by Farman, Gardinar and Shanklin. (Nature, May 1985) British Antarctic Survey showing that ozone levels had dropped to 10% below normal January levels for Antarctica.

11 Case II: The Antarctic Ozone Hole (cont.) The authors had been somewhat hesitant about publishing this result because Nimbus 7 satellite data had shown NO such DROP during the Antarctic spring! More comprehensive observations from satellite instruments looking down had shown nothing unusual!

12 Case II: The Antarctic Ozone Hole (cont.) But NASA soon discovered that the Spring time ''ozone hole'' had been covered up by a computer program designed to discard sudden, large drops in ozone concentrations as ' ERRORS''. The Nimbus 7 data was rerun without the filter program. Evidence of the Ozone hole was seen as far back as 1976.

13 One person s noise could be another person s signal!

14 What is OUTLIER? No universally accepted definition! Hawkins (1980) An observation (few) that deviates (differs) so much from other observations as to arouse suspicion that it was generated by a different mechanism. Barnett and Lewis (1994) An observation (few) which appears to be inconsistent (different) with the remainder of that set of data.

15 What is OUTLIER? Statistics based intuition Normal data objects follow a generating mechanism, e.g. some given statistical process Abnormal objects deviate from this generating mechanism

16 Applications of outlier detection Fraud detection Purchasing behavior of a credit card owner usually changes when the card is stolen Abnormal buying patterns can characterize credit card abuse

17 Applications of outlier detection (cont.) Medicine Unusual symptoms or test results may indicate potential health problems of a patient Whether a particular test result is abnormal may depend on other characteristics of the patients (e.g. gender, age, )

18 Applications of outlier detection (cont.) Intrusion Detection Attacks on computer systems and computer networks

19 Applications of outlier detection (cont.) Sports statistics In many sports, various parameters are recorded for players in order to evaluate the players performances Outstanding (in a positive as well as a negative sense) players may be identified as having abnormal parameter Values Sometimes, players show abnormal values only on a subset or a special combination of the recorded parameters

20 Applications of outlier detection (cont.) Ecosystem Disturbance Hurricanes, floods, heatwaves, earthquakes

21 Applications of outlier detection (cont.) Detecting measurement errors Data derived from sensors (e.g. in a given scientific experiment) may contain measurement errors Abnormal values could provide an indication of a measurement error Removing such errors can be important in other data mining and data analysis tasks

22 What causes OUTLIERS? Data from Different Sources Such outliers are often of interest and are the focus of outlier detection in the field of data mining. Natural variant Outliers that represent extreme or unlikely variations are often interesting.. (Correct but extreme responses Rare event syndrome) Data Measurement and Collection Error Goal is to eliminate such anomalies since they provide no interesting information but only reduce the quality of the data and subsequent data analysis.

23 Difference between Noise and Outlier Outlier Noise

24 Approaches to OUTLIER detection

25 Approaches to OUTLIER detection Statistical (or model based) approaches Proximity based Clustering based Classification based (One class and Semisupervised (i.e. Combining classification based and clustering based methods) Reference: Data Mining: Concepts and Techniques, Han et al. 2012

26 Statistical (or model based) approaches Assume that the regular data follow some statistical model. Outliers : The data not following the model Example: First use Gaussian distribution to model the regular data For each object y in region R, estimate g D (y), the probability of y fits the Gaussian distribution If g D (y) is very low, y is unlikely generated by the Gaussian model, thus an outlier Effectiveness of statistical methods: highly depends on whether the assumption of statistical model holds in the real data. There are rich alternatives to use various statistical models E.g., parametric vs. non parametric

27 Statistical Approaches Parametric method Assumes that the normal data is generated by a parametric distribution with parameter θ. The probability density function of the parametric distribution f(x, θ) gives the probability that object x is generated by the distribution The smaller this value, the more likely x is an outlier Non parametric method Not assume an a priori statistical model and determine the model from the input data Not completely parameter free but consider the number and nature of the parameters are flexible and not fixed in advance Examples: histogram and kernel density estimation

28 Proximity based Outlier : if the proximity of the obs. significantly deviates from the proximity of most of the other obs. in the same data set. The effectiveness: highly relies on the proximity measure. In some applications, proximity or distance measures cannot be obtained easily. Two major types of proximity based outlier detection Distance based: Based on distances (outlier if its neighborhood does not have enough other points) Density based: if its density is relatively much lower than that of its neighbors.

Clustering Based Methods Normal data belong to large and dense clusters, outliers to small or sparse clusters, or do not belong to any clusters Many clustering methods, therefore many

29 Clustering Based Methods Normal data belong to large and dense clusters, outliers to small or sparse clusters, or do not belong to any clusters Many clustering methods, therefore many clustering based outlier detection methods! Clustering is expensive: straightforward adaption of a clustering method for outlier detection can be costly and does not scale up well for large data sets

Classification Based Methods Idea: Train a classification model that can distinguish normal data from outliers A brute force approach: Consider a training set that contains samples labeled as normal

30 Classification Based Methods Idea: Train a classification model that can distinguish normal data from outliers A brute force approach: Consider a training set that contains samples labeled as normal and others labeled as outlier But, the training set is typically heavily biased: # of normal samples likely far exceeds # of outlier samples Cannot detect unseen anomaly Two types: One class and Semi supervised

Classification Based Methods (cont.) One class model A classifier is built to describe only the normal class. Learn the decision boundary of the normal class using classification methods such as SVM.

31 Classification Based Methods (cont.) One class model A classifier is built to describe only the normal class. Learn the decision boundary of the normal class using classification methods such as SVM. Any samples that do not belong to the normal class (not within the decision boundary) are declared as outliers. Adv: can detect new outliers that may not appear close to any outlier objects in the training set

32 Classification Based Methods (cont.) Semi supervised learning Combining classification based and clusteringbased methods Method Using a clustering based approach, find a large cluster, C, and a small cluster, C 1 Since some obs. in C carry the label normal, treat all obs. in C as normal Use the one class model of this cluster to identify normal obs. in outlier detection Since some obs. in cluster C 1 carry the label outlier, declare all obs. in C 1 as outliers Any obs. that does not fall into the model for C (such as a) is considered an outlier as well.

33 Sensitivity of Statistical Methods to Outliers

34 Sensitivity of Statistical Methods to Outliers Data often (always) contain outliers. Statistical methods are severely affected by outliers!

35 Sensitivity of Statistical Methods to Outliers on the sample mean (p=2)

36 Sensitivity of Statistical Methods to Outliers on Regression Analysis Scatterplot of Y vs X Scatterplot of Y vs X Y 10 Y X X (a) with (6,20) (b) without (6,20)

37 Sensitivity of Statistical Methods to Outliers on Classification Forest Soil data (n 1 =11, n 2 =23, n 3 =24) 3D Scatterplot of sod vs mag vs pot gr Misclassification Error Rate: G1: 91 % (10 obs.) 60 G2: 4% ( 9 obs.) sod pot mag G3:75% (18 obs.) Overall: 50%

38 Statistical Methods for OUTLIER Detection for Univariate & Multivariate data

39 Statistical (or model based) approaches Assume that the regular data follow some statistical model. Outliers : The data not following the model Effectiveness of statistical methods: highly depends on whether the assumption of statistical model holds in the real data. There are rich alternatives to use various statistical models E.g., parametric vs. non parametric

40 NOTATION X : Data (Gene Expression Levels) matrix of size np n : Number of patients p : Number of genes p=1 Univariate Data p>1 Multivariate Data n>p : Low dimensional n<p : High dimensional X x x. x n1 Genes defined in columns x x x n x x x 1p 2p. np n patients for the gene p

41 OUTLIERS in Univariate data Standard Deviation(SD) METHOD The simple classical approach to screen outliers is to use the SD (Standard Deviation) method. 2 SD Method: mean ±2 SD 3 SD Method: mean ±3 SD, where the mean is the sample mean and SD is the sample standard deviation. The observations outside these intervals may be considered as outliers!

42 OUTLIERS in Univariate data (cont.) Z Scores method (Grubbs' Test, 1969): Not robust! Xi X Zi, i 1,2,...,n s Robust Z Scores (Iglewicz and Hoaglin, 1993) Z i 0.675(X i median(x)), MAD(X) i 1,2,...,n where MAD= median{ X i median(x) } & E(MAD )=0.675 σ for large normal data. Rule: If Z i >3 i th observation is outlier!!!

43 Why Robust Statistics Data often (always) contain outliers. Classical estimates, such as the sample mean, the sample variance, sample covariances and correlations, or LS (leastsquares) fit, are severely affected by outliers. Robust estimates provide a good fit to the bulk of the data & this good fit helps to identify OUTLIER (s) accurately!!! In this lecture, an overview of the concepts that will be covered is given.

44 Issues: When is an observation outlying enough to be detected? Deleting good observations results in inaccurate estimations So, Are there any other alternatives to mean and standard deviation? One suggestion: MEDIAN and Median Absolute Deviation (MAD)

45 Mean and Standard Deviation Let x=(x 1, x 2,, x n ) be a set of observed values. Then, sample mean and variance are given as: Example: Dotplot represents the data containing 24 determinations of the copper content in wholemeal flour (in parts per million) : with w/o mean sd

46 Median: MAD: To make the MAD comparable to SD, MADN (normalized MAD): (If XN(, 2 ), then MADN(X)=) with w/o mean sd with w/o median MAD

47 Replace observation 24 with numbers b/w , and calculate location measures: Location mean median x

48 Replace observation 24 with numbers b/w , and calculate scale measures: Scale sd MAD Thus we can say that a single outlier has an unbounded influence on these two classical statistics x

49 Example : The following is the Q Q plot for a dataset containing 20 determinations of the time (in microseconds) needed for light to travel a distance of 7442 m. SD (i.e.three Sigma Edit Rule) fails to identify one of the observations determined by Q Q plot. z=-1.35 (-4.64) (-2) z=-3.73 (-11.73) (Obs -44) Replace sample mean and sd in the rule by median and MAD, We can identify both of the outliers (values in the parentheses) The reason that 2 has such a small z i value is that both observations pull the sample mean x to the left and inflate s; it is said that the value 44 masks the value 2.

50 OUTLIERS in Univariate data (cont.) BOXPLOT (Tukey s method, 1977) Values Column Number A value between the inner ( [Q1 1.5*IQR, Q3+1.5*IQR]) and outer ([Q1 3 IQR, Q3+3 IQR] ) fences is a possible outlier. An extreme value beyond the outer fences is a probable outlier.

51 OUTLIERS in Univariate data (cont.) Adjusted BOXPLOT (Vanderviere and Huber, 2004) Tukey s method is based on robust measures such as lower and upper quartiles and the IQR without considering the skewness of the data. Vanderviere and Huber (2004) introduced an adjusted boxplot taking into account the medcouple (MC), a robust measure of skewness for a skewed distribution. MC(x 1,...,x n (x ) med i and j have to satisfy x i med k x j, and x i x j. The interval of the adjusted boxplot is as follows (G. Bray et al. (2005)): [L, U] = [Q1 1.5 * exp ( 3.5MC) * IQR, Q3+1.5 * exp (4MC) * IQR] if MC 0 = [Q1 1.5 * exp ( 4MC) * IQR, Q3+1.5 * exp (3.5MC) * IQR] if MC 0, j med k x ) (med j x i k x i )

52 R Package robustbase ( 2011) Adjusted BOXPLOT (Vanderviere and Huber, 2004)

53 OUTLIERS in Univariate data (cont.) MAD e method uses the median and the Median Absolute Deviation (MAD 2 MAD e Method: Median ± 2 MAD e 3 MAD e Method: Median ± 3 MAD e, where MAD e =1.483 MAD for large normal data. MAD is an estimator of the spread in a data, similar to the standard deviation, but has an approximately 50% breakdown point like the median. where MAD= median ( x i median(x), i=1,2,,n)

54 Dot plot: Dotplot of C C Histogram: Outlier 4 3 Frequency C

55 Outliers in Multivariate Data

56 Outliers in Multivariate Data Multivariate data: A data set involving two or more variables Idea: Transform the multivariate outlier detection task X x x. x n1 Genes defined in columns x x x n x x x 1p 2p. np n patients for the gene p into a univariate outlier detection problem.

57 OUTLIERS in Multivariate Data Visual tools Scatter plots and 3D scatter plots Higher dimensions???

58 OUTLIERS in Multivariate Data Chernoff faces (Chernoff,1973 & Flury and Riedwyl, 1988)

59 OUTLIERS in Multivariate Data Andrews curve Coding and representing multivariate data by curves (Andrews, 1972).

60 Classical and Robust Statistical Distance based Methods

61 Statistical distance based methods (n>p) Method. Detect outliers by computing a measure of how far a particular point is from the center of the data. The usual measure of outlyingness for a data point is the Mahalanobis (1936) distance: D i (x i x)'s 1 (x i x), i 1,2,...,n Use the Grubb's test (maximum normed residual test another statistical method under normal distribution) on this measure to detect outliers)

62 Statistical distance based methods The usual measure of outlyingness for a data point is the Mahalanobis (1936) distance: D i (x i x)'s 1 (x i x), i 1,2,...,n However x and S are sensitive to outliers. Robust version of this method is needed!

63 Robust Statistical Distance based Methods Two Phases for Outlier Detection Methods (Rocke and Woodruff, 1996) Obtain robust estimates of location T and scatter C. Calculate robust Mahalanobis type distance: RD i (x i T) T C 1 (x i T), i 1,...,n Outlier Boundary: Determine separation boundary Q. If RD i > Q, ith obs. is declared outlier.

64 Robust Statistical Distance based Methods MVE (minimum volume ellipsoid) MCD (minimum covariance determinant), FAST MCD by Rousseuw & Van Zomeren (1990), Van Driessen&Rousseuw (1999) R package robustbase ( covmcd() ) project.org/web/packages/robustbase/vignettes/fastmcdkmini.pdf project.org/web/packages/robustbase/robustbase.pdf

65 MCD Algorithm Determine ellipsoid containing h [(n+p+1)/2] points with minimum covariance determinant. RD i (x i x mcd ) T S 1 mcd (x i x mcd ), i 1,...,n Exact solution requires combinatoric solution; approximations used in practice. Limitations: large p (very slow!)

66 An Application for MCD: Average brain and body weights for 28 species of land animals.

67 An application for MCD : Average brain and body weights for 28 species of land animals.

68 Whoever knows the ways of Nature will more easily notice her deviations; and, on the other hand, whoever knows her deviations will more accurately describe her ways. Francis Bacon (1620), Novum Organum II 29.

69 Robust Statistical Distance based Methods BACON ( Blocked Adaptive Computationally Efficient Outlier Nominators ) (Billor, Hadi and Velleman, 2000) R Package robustx (depends on robustbase ) ( project.org/web/packages/robustx/robustx.pdf)

70 Algorithm 1: General BACON Algorithm Step 1: Identify an initial basic subset of m>p observations that can safely be assumed free of outliers, where p is the dimension of the data and m is an integer chosen by the data analyst. Step 2: Fit an appropriate model to the basic subset, and from that model compute discrepancies for each of the observations. Step 3: Find a larger basic subset consisting of observations known (by their discrepancies) to be homogeneous with the basic subset. Generally, these are the observations with smallest discrepancies. This new basic subset may omit some of the previous basic subset observations, but it must be as large as the previous basic subset. Step 4: Iterate Steps 2 and 3 to refine the basic subset, using a stopping rule that determines when the basic subset can no longer grow safely. Step 5: Nominate the observations excluded by the final basic subset as outliers.

71 Algorithm 2: Initial Basic Subset in Multivariate Data Input: An n x p data matrix X & a number, m, of observations to include in the initial basic subset. Output: An initial basic subset of at least m observations. Two versions: Version 1 (V1): Initial subset selected based on Mahalanobis distances Version 2 (V2): Initial subset selected based on distances from the medians

72 BACON Algorithm (Billor, Hadi and Vellemann, 2000) Therefore two versions of BACON: One version: nearly affine equivariant, a high breakdown point (upwards of 40%), and computationally efficient even for very large datasets. Another version: affine equivariant, at the expense of a somewhat lower breakdown point (about 20%), but with the advantage of even lower computational cost even for very large datasets.

73 BACON Algorithm (cont.) Step 1: Divide the observations according to a suitably chosen initial distance d i into two subsets: Basic subset and Non basic subset Basic subset: a small subset initially of size m = cp observations with the smallest values of d i Non basic subset: the rest of the data Basic Non-basic 12 m n

74 BACON Algorithm (cont.) Step 2. Compute d i (x b,s b ) (x i x b ) T S b 1 (x i x b ),i 1,...,n x b where and S b are the mean and covariance of the basic subset. Step 3. Form a new basic subset by including all observations with c : a critical value (Chi squared table value). Step 4. Repeat Steps 2 3 until the size of the basic subset = n or all (x observations in the non basic subset have d i ) c and declare the observations in the nonbasic subset (if any) as outliers b,s b d i (x b,s b ) c

75 An Application for BACON: Average brain and body weights for 28 species of land animals.

76 Advantages of BACON algorithm Outlier detection methods have suffered in the past from a lack of generality and a computational cost that escalated rapidly with the sample size. Samples of a size sufficient to support sophisticated methods rapidly grow too large for previously published outlier detection methods to be practical. The BACON algorithms given here reliably detect multiple outliers at a cost that can be as low as four repetitions of the underlying fitting method. They are thus practical for data sets of even millions of cases. The BACON algorithms balance between affine equivariance and robustness.

77 Outlier Detection Methods based on Principal Component Analysis (PCA)

78 What is PCA? Generally speaking Two objectives: Data or Dimension reduction moving from many original variables down to a few composite variables (linear combinations of the original variables). Interpretation which variables play a larger role in the explanation of total variance. 78

79 PCA: Principal Component Analysis PC s are constructed by maximizing u i argmax{var(u'x)} u'u1 ' subject to u 'X Xu i j 0, 1 j i Z i = u i1 X u ip X p = u i X, i=1,2,,k the ith PC score or the ith principal component

80 Geometric understanding of PCA for 3D point cloud 1st PC direction (maximizing variance of projections) explains the cloud best. 1st and 2nd directions form a plane.

81 = X 257x235 k=2 k=10 k=50 k=100 k=200

82 PCA & Outlier Detection Methods that use classical PCs to identify potential outliers & Methods for robustly estimating PCs that may also be used to detect potential outliers.

83 Types of Outliers in PC space Orthogonal Good leverage Bad leverage Regular observations Good leverage Bad leverage

84 PCA based Outlier Detection Diagnostic Plot: To detect the type of observations in high dimensional data OD i X i ˆ zp i vs. D i k j1 z 2 ij l j, i=1,2,,n where the z i is the ith row oft nxk X nxp 1 n ˆ t P pxk l j : the eigenvalues of the variance covariance matrix (j=1,2,...,k) P : the matrix of eigenvectors corresponding to the eigenvalues

85 Diagnostic plot ODi Orthogonal outliers Bad leverage Regular Good leverage

86 Sensitivity of PCA to Outliers CPCA would mislead the analyst in the presence of outliers since the cov. matrix (or corr. matrix) is sensitive to outliers! Two dimensional example showing how outliers affect PCA

87 Robust PCA based Outlier Detection A) Replace the classical covariance matrix by a robust covariance estimator M estimator: Maronna (1976), Campbell (1980) MCD method: Rousseuw & Van Driessen (1999) S estimator: Davies (1987), Rousseeuw and Leroy (1987) PROBLEM: valid ONLY for p<n (no very large p) and NOT suitable for high dimensional data sets (p>n)

88 Robust PCA B) Approaches based on projection pursuit (pp) Li and Chen (1985) (high computational cost!) Croux, Ruiz Gazen, 2000) (numerical inaccuracy!) Hubert et al. (2002) (searches the direction on which the projected observations have the largest robust scale, and then removes this dimension and repeats.) Suitable for data sets with large p and/or large n! Problem: numerical inaccuracy and high computational cost! 88

89 Robust PCA C) Based on combination of ideas of PP and robust covariance estimation i) based on FAST MCD (Hubert et al., 2005) (ROBPCA) ii) based on BACON (RBPCA) (blocked adaptive computationally efficient outlier nominators) (Billor et al., 2000) 89

90 RBPCA: Steps in this algorithm Case 1: n>p Step 1. Use (Singular Value Decomposition ) SVD on the centered data matrix that is, X c X 1ˆ' UDP' (affine transformation of the data) Score matrix: T=X c P Step 2. Determine the mean ( ˆ B ) and the variance covariance matrix ( ˆ ) based on clean observations obtained from BACON. B Step 3. Find the robust PCs of the BACON based covariance matrix, and determine the # of PCs, as k. Step 4. The new robust score matrix T (X 1ˆ B ')P * 90

91 RBPCA: Steps in this algorithm Case 2: n<p Step 1. Use SVD (Singular Value Decomposition ) on the centered data matrix that is, X c X 1ˆ' UDP' (affine transformation of the data) Score matrix: T=X c P 91

92 RBPCA: Steps in this algorithm Step 2. Find the clean set of observations (say h=(n+p+1)/2) from T (n<p) by using the outlyingness measure Project the high dimensional datapoints on many (?) univariate directions v 92

93 Outlyingness measure For every direction v, find robust center, ˆ R and robust standard deviation, ˆ R for the projected x j v (j=1,2,,n). Find out which points are outlying on the projection vector by Outlyingness measure: Determine h clean observations (n/2 < h < n, h=0.75*n, or outl i R x max,i 1,..., n i v n p 1 ) h 2 x v ˆ ˆ R 93

94 Projecting points onto different projection vectors (dashed lines) B is outlying, but not A, C B is not outlying here 94

95 RBPCA: Steps in this algorithm Apply PCA on the data matrix of h clean observations (T hxp ) and obtain T T 1 ˆ 1 np Find the variance covariance matrix of T 1 Determine # of PCs, k < p st k<n. Step 3. Estimate the mean vector and scatter matrix of the PC scores matrix T 1 by using the BACON algorithm. Step 4. Obtain the robust PCs based on the robust variance covariance matrix in step 3. 95

96 Issues for High Dimensional Data Even for fast algorithms the computation times increase linearly with n and cubically with p! None of these methods do not work quite as well when the dimensionality is high!!!

97 PCOUT ALGORITHM (Filzmoser et al., 2008) A recent outlier identification algorithm & effective in high dimensions. Based on the robust distances obtained from semi robust principal components for robustly sphered data. Separate weights for location and scatter outliers are computed based on these distances. The combined weights are used for outlier identification (See R: pcout)

98 PCOUT Algorithm (Filzmoser et al., 2008) Step1. Calculate : Step 2. contribute to at least 99%of the total variance. Robustly spheredata :z The values {z Robustly sphere data : x sample covariance matrix of Compute PCs based on ij ij med(z mad(z } are used in 2phases of z ij ij med(x mad(x C, retain only those"p",..., z,..., z finding location outliers finding scatter outliers x ij 1j 1j the transformed data nj 1j ),..., x nj 1j ),..., x, j nj ) nj ) 1,2,...,p the further steps : PCs that

99 Calculate Mah. distances d L (weighted based on robust kurtosis) and d S (unweighted).calculate scatter and location weights from M d 1 c d M M c M d 1 c d 0,M ) c (d ; w i i 2 2 i i i PCOUT Algorithm (Filzmoser et al., 2008) For location: M is the 33(1/3 rd ) quantile of d L c= median(d)+2.5mad(d) For scale: M is,., c=,.

100 Final weights: w If LS i w (w LS i S i c)(w (1 c) L i 2 c),c , ith observation is an outlier.

101 PCOUT: Leukemia Data (72x7129) We will try to identify multivariate outliers among the 7129 genes, without using the information of the two leukemia types ALL and AML. The outlying genes will then be used for differentiating between the cases genes

102 Outliers in Functional Data

103 Introduction FDA : collection of different methods in statistical analysis for analyzing curves or functional data. In standard statistical analysis: focus is on the set of data vectors (univariate, multivariate). In FDA, focus is on the type of data structure such as curves, shapes, images, or set of functional observations.

104 What are Functional Data about? Figure: The change in temperature over the course of a year, taken from thirty five weather stations across Canada (Ramsay and Silverman, 2001). Atlantic stations in red, Continental in blue, Pacific in green, and Arctic in black.

105 What Questions can we ask of the functional data? Statisticians: How can I represent the temperature pattern of a Canadian city over the entire year instead of just looking at the twelve discrete points? Should I just "connect the dots," or is there a better way to do this? Do the summary statistics "mean" and "covariance" have any meaning when I'm dealing with curves? How can I determine the primary modes of variation in the data? How many typical modes can summarize these thirty five curves? Do these curves exhibit strictly sinusoidal behavior? Can I create an analysis of variance (ANOVA) or linear model with the curves as the response and the climate as the main effect?

106 Outliers in Functional Data Problem: If we have some curves that behave differently from the rest (i.e. outliers in functional form), what happens to the FDA techniques???

107 Outliers in Functional Data The study of outlier detection has started only recently, and was mostly limited to univariate curves, i.e. p = 1. Febrero Bande et al (2008) identified two reasons why outliers can be present in functional data: 1. gross errors can be caused by errors in measurements and recording or typing mistakes, which should be identified and corrected if possible. 2. Second, outliers can be correctly observed data curves that are suspicious or surprising in the sense that they do not follow the same pattern as that of the majority of the curves.

108 Methods Functional Depth based Functional PCA based Functional Boxplot

109 Functional Depth based

Euclidean space of any dimension and leads to a new nonparametric multivariate statistical analysis in which no distributional

110 What is Data Depth? Iso-Depth curves Notion of data depth for non parametric multivariate data analysis Provides center outward orderings of points in Euclidean space of any dimension and leads to a new nonparametric multivariate statistical analysis in which no distributional assumptions are needed. Deepest point A data depth measures how deep (or central) a given point x in R p is relative to F, a probability distribution in R p (assuming {X 1,.., X n } is a random sample from F) or relative to a given data cloud. 110

111 Consider x 1 (t), x 2 (t),..., x n (t) : n functions defined on an interval I. Question: Which one is the deepest function?

112 Depth for functional data. used to measure the centrality of a curve with respect to a set of curves. E.g.: to define the deepest function. The functional depth provides a center outward ordering of a sample of curves. Order statistics are defined. The idea of deepest point of a set of data allows to classify a new observation by using the distance to a class deepest point.

113 Many Data Depths Functions in MV 1. The Mahalanobis depth (Mahalanobis, 1936). 2. The half space depth (Hodges, 1955, Tukey, 1975). 3. The Oja depth (Oja, 1983). 4. The simplicial depth (Liu, 1990). 5. The majority depth (Singh, 1991). 6. The projection depth (Zuo, 2003). Example: MD(y,F) ' 1 1 y y 1 F F F

114 The Fraiman and Muniz Depth (FMD) A functional data depth (Fraiman and Muniz (2001)): F n,t (x i (t)) : empirical cdf of the values of the curves x 1 (t),..., x n (t) at a given time point t [a, b], given by Fraiman and Muniz functional depth, hereafter FMD, of a curve x i with respect the set x 1,..., x n: where D n (x i (t)) is the univariate depth of the point x i (t) given by 1 1 2,

115 Functional Depth based Method Febrero Bande et al (2008) proposed the following outlier detection procedure for univariate functional data (i.e. p = 1) Step 1. For each curve, calculate its functional depth (several versions exist); Step 2. Delete observations with depth below a cutoff C; Step 3. Go back to step 1 with the reduced sample, and repeat until no outliers are found. Step 3 was added in the hope of avoiding masking effects. The cutoff value C was obtained by a bootstrap procedure.

116 Example: Outlier Curves

117

118 Pros & Cons Fast computation; Non parametric method. Dependent on chosen Depth. Choice of cutoff C is complicated.

119 Functional PCA based

120 Functional Principal Components Analysis Similar to multivariate Principal Components Analysis (PCA), but instead of vectors (x i1,, x in ), we use curves x i (t); Summations become integrals, e.g., PC Scores corresponding to is Finding the PC is equivalent to solving finding the eigenfunctions/eigenvalues of covariance operator G(s, t), that is where G(s,t) = Cov(x(s), x(t)).

121 Functional PCA using basis expansion Select an appropriate orthogonal base of L 2, Φ k (t), k = 1, 2, Can represent each curve x i (t): where the number of basis, k, is selected via Generalized Cross Validation, for example. The coefficients can be estimated using Least Squares. Assume y ij = x i (t j ) + ε ij, ε ij is a measurement error; Obtain matrix C = (c ik ). Outliers in C will be the functional outliers. Apply Robust Multivariate PCA to C.

122 Robust PCA Examples Spherical PCA projects each observation in the unit sphere: where is a robust estimator of the location parameter. ROBPCA Uses Minimum Covariance Determinant (MCD) for low dimensional data ( n > p), or a combination of Projection Pursuit and MCD for high dimensional data (n>p). BACON PCA Block Adaptive Computationally Efficient Outlier Nominators (BACON):

123 Diagnostic Plots for Robust PCA Score distances: i = 1,,n where z ij are the PC Scores, and λ j are the eigenvalues; Orthogonal Distance: is the robust center estimator, is the matrix of eigenvalues.

124 Diagnostic Plots for Robust PCA Sd i large and Od i small 1, 4; Sd i small and Od i large 5; Sd i large and Od i large 2, 3;

125 Poblenou NOx Data (Whole Dataset) Sawant et al (2012) identified 5 outlying curves using a robust functional PCA approach. Identified outliers were all working days 03/09; 03/11; 03/18; 04/29 and 05/02. Outliers were on days leading to a long weekend or vacation period and hence there was increased traffic flow. NOx emissions in Poblenou, Barcelona (Spain) over 115 days. Hourly measurements of the NOx made from 23 February 2005 to 26 June 2005.

126 CFPCA RFPCA-MCD RFPCA-BACON

127

128

129 Conclusion After detecting the outliers, we checked for sources for abnormal values of these curves. It was found that the days detected as outliers were weekends or related to small vacation periods around weekends. So we conclude that abnormal observations on specific days can be attributed to increase in traffic due to small vacation periods. We have also detected outlier on Wednesday, March 9. The observation on 10th March has missing data and thus not included inanalysis. So we could not pinpoint the reason behind this abnormal observation on 9 th March.

130 Functional Boxplot

131 Example (Functional Boxplot) Data from monthly sea temperatures over the East Central tropical Pacific ocean,

132 Functional Box Plot Extend the univariate box plot using Data Depth The functional boxplot of Sea Surface Temp with 1. blue curves denoting envelopes, and 2. a black curve representing the median curve. 3. The red dashed curves are the outlier candidates detected by the 1.5 times the 50% central region rule.

133 Functional Box Plot The enhanced functional boxplot of SST with 1. dark magenta denoting the 25% central region, 2. magenta representing the 50% central region and 3. pink indicating the 75% central region.

134 Functional & Pointwise Boxplots The pointwise boxplots of SST with medians connected by a black line.

135 Pros & Cons Fast computation. Clear visualization. Choice of BD and MBD may not be optimal. Bad performance for shape outliers. The command fbplot for functional boxplots is in fda R package, and MATLAB code is also available.

136 Selected References 1. Barnett, V., and T. Lewis (1994). Outliers in Statistical Data, 3rd ed., New York: Wiley. 2. Billor, N., Hadi, A., & Velleman, P., (2000). BACON: Blocked adaptive computationally efficient outlier nominators. Computational Statistics and Data Analysis,34, Filzmoser, P., Maronna, R. & Werner, M. (2008). Outlier identification in high dimensions. Computational Statistics and Data Analysis, Vol. 52, pp Hyndman, R. J. and Shang, H.L. (2010). "Rainbow Plots, Bagplots, and Boxplots for Functional Data". Journal of Computational and Graphical Statistics 19 (1): López Pintado, S. and Romo, J. (2009). "On the Concept of Depth for Functional Data". Journal of the American Statistical Association 104 (486): Sun, Y. and Genton, M. G. (2011). "Functional boxplots". Journal of Computational and Graphical Statistics 20: Sawant, P., Billor, N., and Shin, H. (2012) Functional outlier detection with robust functional principal component analysis. Computational Statistics, 27,

OUTLIER MINING IN HIGH DIMENSIONAL DATASETS

OUTLIER MINING IN HIGH DIMENSIONAL DATASETS DATA MINING DISCUSSION GROUP OUTLINE MOTIVATION OUTLIERS IN MULTIVARIATE DATA OUTLIERS IN HIGH DIMENSIONAL DATA Distribution-based Distance-based NN-based Density-based