OUTLIER DETECTION. Short Course Session 1. Nedret BILLOR Auburn University Department of Mathematics & Statistics, USA

Size: px
Start display at page:

Download "OUTLIER DETECTION. Short Course Session 1. Nedret BILLOR Auburn University Department of Mathematics & Statistics, USA"

Transcription

1 OUTLIER DETECTION Short Course Session 1 Nedret BILLOR Auburn University Department of Mathematics & Statistics, USA Statistics Conference, Colombia, Aug 8 12, 2016

2 OUTLINE Motivation and Introduction Approaches to Outlier Detection Sensitivity of Statistical Methods to Outliers Statistical Methods for Outlier Detection Outliers in Univariate data Outliers in Multivariate Classical and Robust Statistical Distancebased Methods PCA based Outlier Detection Outliers in Functional Data

3 MOTIVATION & INTRODUCTION Hadlum vs. Hadlum (1949) [Barnett 1978] Ozone Hole

4

5 Case I: Hadlum vs. Hadlum (1949) [Barnett 1978] The birth of a child to Mrs. Hadlum happened 349 days after Mr. Hadlum left for military service. Average human gestation period is 280 days (40 weeks). Statistically, 349 days is an outlier.

6 Case I: Hadlum vs. Hadlum (1949) [Barnett 1978] blue: statistical basis (13634 observations of gestation periods) green: assumed underlying Gaussian process Very low probability for the birth of Mrs. Hadlums child for being generated by this process red: assumption of Mr. Hadlum (another Gaussian process responsible for the observed birth, where the gestation period responsible) Under this assumption the gestation period has an average duration and highest possible probability

7 Case II: The Antarctic Ozone Hole The History behind the Ozone Hole The Earth's ozone layer protects all life from the sun's harmful radiation.

8 Case II: The Antarctic Ozone Hole (cont.) Human activities (e.g. CFS's in aerosols) have damaged this shield. Less protection from ultraviolet light will, over time, lead to higher skin cancer and cataract rates and crop damage.

9 Case II: The Antarctic Ozone Hole (cont.) Molina and Rowland in 1974 (lab study) and many studies after this, demonstrated the ability of CFC's (Chlorofluorocarbons) to breakdown Ozone in the presence of high frequency UV light. Further studies estimated the ozone layer would be depleted by CFC's by about 7% within 60yrs.

10 Case II: The Antarctic Ozone Hole (cont.) Shock came in a 1985 field study by Farman, Gardinar and Shanklin. (Nature, May 1985) British Antarctic Survey showing that ozone levels had dropped to 10% below normal January levels for Antarctica.

11 Case II: The Antarctic Ozone Hole (cont.) The authors had been somewhat hesitant about publishing this result because Nimbus 7 satellite data had shown NO such DROP during the Antarctic spring! More comprehensive observations from satellite instruments looking down had shown nothing unusual!

12 Case II: The Antarctic Ozone Hole (cont.) But NASA soon discovered that the Spring time ''ozone hole'' had been covered up by a computer program designed to discard sudden, large drops in ozone concentrations as ' ERRORS''. The Nimbus 7 data was rerun without the filter program. Evidence of the Ozone hole was seen as far back as 1976.

13 One person s noise could be another person s signal!

14 What is OUTLIER? No universally accepted definition! Hawkins (1980) An observation (few) that deviates (differs) so much from other observations as to arouse suspicion that it was generated by a different mechanism. Barnett and Lewis (1994) An observation (few) which appears to be inconsistent (different) with the remainder of that set of data.

15 What is OUTLIER? Statistics based intuition Normal data objects follow a generating mechanism, e.g. some given statistical process Abnormal objects deviate from this generating mechanism

16 Applications of outlier detection Fraud detection Purchasing behavior of a credit card owner usually changes when the card is stolen Abnormal buying patterns can characterize credit card abuse

17 Applications of outlier detection (cont.) Medicine Unusual symptoms or test results may indicate potential health problems of a patient Whether a particular test result is abnormal may depend on other characteristics of the patients (e.g. gender, age, )

18 Applications of outlier detection (cont.) Intrusion Detection Attacks on computer systems and computer networks

19 Applications of outlier detection (cont.) Sports statistics In many sports, various parameters are recorded for players in order to evaluate the players performances Outstanding (in a positive as well as a negative sense) players may be identified as having abnormal parameter Values Sometimes, players show abnormal values only on a subset or a special combination of the recorded parameters

20 Applications of outlier detection (cont.) Ecosystem Disturbance Hurricanes, floods, heatwaves, earthquakes

21 Applications of outlier detection (cont.) Detecting measurement errors Data derived from sensors (e.g. in a given scientific experiment) may contain measurement errors Abnormal values could provide an indication of a measurement error Removing such errors can be important in other data mining and data analysis tasks

22 What causes OUTLIERS? Data from Different Sources Such outliers are often of interest and are the focus of outlier detection in the field of data mining. Natural variant Outliers that represent extreme or unlikely variations are often interesting.. (Correct but extreme responses Rare event syndrome) Data Measurement and Collection Error Goal is to eliminate such anomalies since they provide no interesting information but only reduce the quality of the data and subsequent data analysis.

23 Difference between Noise and Outlier Outlier Noise

24 Approaches to OUTLIER detection

25 Approaches to OUTLIER detection Statistical (or model based) approaches Proximity based Clustering based Classification based (One class and Semisupervised (i.e. Combining classification based and clustering based methods) Reference: Data Mining: Concepts and Techniques, Han et al. 2012

26 Statistical (or model based) approaches Assume that the regular data follow some statistical model. Outliers : The data not following the model Example: First use Gaussian distribution to model the regular data For each object y in region R, estimate g D (y), the probability of y fits the Gaussian distribution If g D (y) is very low, y is unlikely generated by the Gaussian model, thus an outlier Effectiveness of statistical methods: highly depends on whether the assumption of statistical model holds in the real data. There are rich alternatives to use various statistical models E.g., parametric vs. non parametric

27 Statistical Approaches Parametric method Assumes that the normal data is generated by a parametric distribution with parameter θ. The probability density function of the parametric distribution f(x, θ) gives the probability that object x is generated by the distribution The smaller this value, the more likely x is an outlier Non parametric method Not assume an a priori statistical model and determine the model from the input data Not completely parameter free but consider the number and nature of the parameters are flexible and not fixed in advance Examples: histogram and kernel density estimation

28 Proximity based Outlier : if the proximity of the obs. significantly deviates from the proximity of most of the other obs. in the same data set. The effectiveness: highly relies on the proximity measure. In some applications, proximity or distance measures cannot be obtained easily. Two major types of proximity based outlier detection Distance based: Based on distances (outlier if its neighborhood does not have enough other points) Density based: if its density is relatively much lower than that of its neighbors.

29 Clustering Based Methods Normal data belong to large and dense clusters, outliers to small or sparse clusters, or do not belong to any clusters Many clustering methods, therefore many clustering based outlier detection methods! Clustering is expensive: straightforward adaption of a clustering method for outlier detection can be costly and does not scale up well for large data sets

30 Classification Based Methods Idea: Train a classification model that can distinguish normal data from outliers A brute force approach: Consider a training set that contains samples labeled as normal and others labeled as outlier But, the training set is typically heavily biased: # of normal samples likely far exceeds # of outlier samples Cannot detect unseen anomaly Two types: One class and Semi supervised

31 Classification Based Methods (cont.) One class model A classifier is built to describe only the normal class. Learn the decision boundary of the normal class using classification methods such as SVM. Any samples that do not belong to the normal class (not within the decision boundary) are declared as outliers. Adv: can detect new outliers that may not appear close to any outlier objects in the training set

32 Classification Based Methods (cont.) Semi supervised learning Combining classification based and clusteringbased methods Method Using a clustering based approach, find a large cluster, C, and a small cluster, C 1 Since some obs. in C carry the label normal, treat all obs. in C as normal Use the one class model of this cluster to identify normal obs. in outlier detection Since some obs. in cluster C 1 carry the label outlier, declare all obs. in C 1 as outliers Any obs. that does not fall into the model for C (such as a) is considered an outlier as well.

33 Sensitivity of Statistical Methods to Outliers

34 Sensitivity of Statistical Methods to Outliers Data often (always) contain outliers. Statistical methods are severely affected by outliers!

35 Sensitivity of Statistical Methods to Outliers on the sample mean (p=2)

36 Sensitivity of Statistical Methods to Outliers on Regression Analysis Scatterplot of Y vs X Scatterplot of Y vs X Y 10 Y X X (a) with (6,20) (b) without (6,20)

37 Sensitivity of Statistical Methods to Outliers on Classification Forest Soil data (n 1 =11, n 2 =23, n 3 =24) 3D Scatterplot of sod vs mag vs pot gr Misclassification Error Rate: G1: 91 % (10 obs.) 60 G2: 4% ( 9 obs.) sod pot mag G3:75% (18 obs.) Overall: 50%

38 Statistical Methods for OUTLIER Detection for Univariate & Multivariate data

39 Statistical (or model based) approaches Assume that the regular data follow some statistical model. Outliers : The data not following the model Effectiveness of statistical methods: highly depends on whether the assumption of statistical model holds in the real data. There are rich alternatives to use various statistical models E.g., parametric vs. non parametric

40 NOTATION X : Data (Gene Expression Levels) matrix of size np n : Number of patients p : Number of genes p=1 Univariate Data p>1 Multivariate Data n>p : Low dimensional n<p : High dimensional X x x. x n1 Genes defined in columns x x x n x x x 1p 2p. np n patients for the gene p

41 OUTLIERS in Univariate data Standard Deviation(SD) METHOD The simple classical approach to screen outliers is to use the SD (Standard Deviation) method. 2 SD Method: mean ±2 SD 3 SD Method: mean ±3 SD, where the mean is the sample mean and SD is the sample standard deviation. The observations outside these intervals may be considered as outliers!

42 OUTLIERS in Univariate data (cont.) Z Scores method (Grubbs' Test, 1969): Not robust! Xi X Zi, i 1,2,...,n s Robust Z Scores (Iglewicz and Hoaglin, 1993) Z i 0.675(X i median(x)), MAD(X) i 1,2,...,n where MAD= median{ X i median(x) } & E(MAD )=0.675 σ for large normal data. Rule: If Z i >3 i th observation is outlier!!!

43 Why Robust Statistics Data often (always) contain outliers. Classical estimates, such as the sample mean, the sample variance, sample covariances and correlations, or LS (leastsquares) fit, are severely affected by outliers. Robust estimates provide a good fit to the bulk of the data & this good fit helps to identify OUTLIER (s) accurately!!! In this lecture, an overview of the concepts that will be covered is given.

44 Issues: When is an observation outlying enough to be detected? Deleting good observations results in inaccurate estimations So, Are there any other alternatives to mean and standard deviation? One suggestion: MEDIAN and Median Absolute Deviation (MAD)

45 Mean and Standard Deviation Let x=(x 1, x 2,, x n ) be a set of observed values. Then, sample mean and variance are given as: Example: Dotplot represents the data containing 24 determinations of the copper content in wholemeal flour (in parts per million) : with w/o mean sd

46 Median: MAD: To make the MAD comparable to SD, MADN (normalized MAD): (If XN(, 2 ), then MADN(X)=) with w/o mean sd with w/o median MAD

47 Replace observation 24 with numbers b/w , and calculate location measures: Location mean median x

48 Replace observation 24 with numbers b/w , and calculate scale measures: Scale sd MAD Thus we can say that a single outlier has an unbounded influence on these two classical statistics x

49 Example : The following is the Q Q plot for a dataset containing 20 determinations of the time (in microseconds) needed for light to travel a distance of 7442 m. SD (i.e.three Sigma Edit Rule) fails to identify one of the observations determined by Q Q plot. z=-1.35 (-4.64) (-2) z=-3.73 (-11.73) (Obs -44) Replace sample mean and sd in the rule by median and MAD, We can identify both of the outliers (values in the parentheses) The reason that 2 has such a small z i value is that both observations pull the sample mean x to the left and inflate s; it is said that the value 44 masks the value 2.

50 OUTLIERS in Univariate data (cont.) BOXPLOT (Tukey s method, 1977) Values Column Number A value between the inner ( [Q1 1.5*IQR, Q3+1.5*IQR]) and outer ([Q1 3 IQR, Q3+3 IQR] ) fences is a possible outlier. An extreme value beyond the outer fences is a probable outlier.

51 OUTLIERS in Univariate data (cont.) Adjusted BOXPLOT (Vanderviere and Huber, 2004) Tukey s method is based on robust measures such as lower and upper quartiles and the IQR without considering the skewness of the data. Vanderviere and Huber (2004) introduced an adjusted boxplot taking into account the medcouple (MC), a robust measure of skewness for a skewed distribution. MC(x 1,...,x n (x ) med i and j have to satisfy x i med k x j, and x i x j. The interval of the adjusted boxplot is as follows (G. Bray et al. (2005)): [L, U] = [Q1 1.5 * exp ( 3.5MC) * IQR, Q3+1.5 * exp (4MC) * IQR] if MC 0 = [Q1 1.5 * exp ( 4MC) * IQR, Q3+1.5 * exp (3.5MC) * IQR] if MC 0, j med k x ) (med j x i k x i )

52 R Package robustbase ( 2011) Adjusted BOXPLOT (Vanderviere and Huber, 2004)

53 OUTLIERS in Univariate data (cont.) MAD e method uses the median and the Median Absolute Deviation (MAD 2 MAD e Method: Median ± 2 MAD e 3 MAD e Method: Median ± 3 MAD e, where MAD e =1.483 MAD for large normal data. MAD is an estimator of the spread in a data, similar to the standard deviation, but has an approximately 50% breakdown point like the median. where MAD= median ( x i median(x), i=1,2,,n)

54 Dot plot: Dotplot of C C Histogram: Outlier 4 3 Frequency C

55 Outliers in Multivariate Data

56 Outliers in Multivariate Data Multivariate data: A data set involving two or more variables Idea: Transform the multivariate outlier detection task X x x. x n1 Genes defined in columns x x x n x x x 1p 2p. np n patients for the gene p into a univariate outlier detection problem.

57 OUTLIERS in Multivariate Data Visual tools Scatter plots and 3D scatter plots Higher dimensions???

58 OUTLIERS in Multivariate Data Chernoff faces (Chernoff,1973 & Flury and Riedwyl, 1988)

59 OUTLIERS in Multivariate Data Andrews curve Coding and representing multivariate data by curves (Andrews, 1972).

60 Classical and Robust Statistical Distance based Methods

61 Statistical distance based methods (n>p) Method. Detect outliers by computing a measure of how far a particular point is from the center of the data. The usual measure of outlyingness for a data point is the Mahalanobis (1936) distance: D i (x i x)'s 1 (x i x), i 1,2,...,n Use the Grubb's test (maximum normed residual test another statistical method under normal distribution) on this measure to detect outliers)

62 Statistical distance based methods The usual measure of outlyingness for a data point is the Mahalanobis (1936) distance: D i (x i x)'s 1 (x i x), i 1,2,...,n However x and S are sensitive to outliers. Robust version of this method is needed!

63 Robust Statistical Distance based Methods Two Phases for Outlier Detection Methods (Rocke and Woodruff, 1996) Obtain robust estimates of location T and scatter C. Calculate robust Mahalanobis type distance: RD i (x i T) T C 1 (x i T), i 1,...,n Outlier Boundary: Determine separation boundary Q. If RD i > Q, ith obs. is declared outlier.

64 Robust Statistical Distance based Methods MVE (minimum volume ellipsoid) MCD (minimum covariance determinant), FAST MCD by Rousseuw & Van Zomeren (1990), Van Driessen&Rousseuw (1999) R package robustbase ( covmcd() ) project.org/web/packages/robustbase/vignettes/fastmcdkmini.pdf project.org/web/packages/robustbase/robustbase.pdf

65 MCD Algorithm Determine ellipsoid containing h [(n+p+1)/2] points with minimum covariance determinant. RD i (x i x mcd ) T S 1 mcd (x i x mcd ), i 1,...,n Exact solution requires combinatoric solution; approximations used in practice. Limitations: large p (very slow!)

66 An Application for MCD: Average brain and body weights for 28 species of land animals.

67 An application for MCD : Average brain and body weights for 28 species of land animals.

68 Whoever knows the ways of Nature will more easily notice her deviations; and, on the other hand, whoever knows her deviations will more accurately describe her ways. Francis Bacon (1620), Novum Organum II 29.

69 Robust Statistical Distance based Methods BACON ( Blocked Adaptive Computationally Efficient Outlier Nominators ) (Billor, Hadi and Velleman, 2000) R Package robustx (depends on robustbase ) ( project.org/web/packages/robustx/robustx.pdf)

70 Algorithm 1: General BACON Algorithm Step 1: Identify an initial basic subset of m>p observations that can safely be assumed free of outliers, where p is the dimension of the data and m is an integer chosen by the data analyst. Step 2: Fit an appropriate model to the basic subset, and from that model compute discrepancies for each of the observations. Step 3: Find a larger basic subset consisting of observations known (by their discrepancies) to be homogeneous with the basic subset. Generally, these are the observations with smallest discrepancies. This new basic subset may omit some of the previous basic subset observations, but it must be as large as the previous basic subset. Step 4: Iterate Steps 2 and 3 to refine the basic subset, using a stopping rule that determines when the basic subset can no longer grow safely. Step 5: Nominate the observations excluded by the final basic subset as outliers.

71 Algorithm 2: Initial Basic Subset in Multivariate Data Input: An n x p data matrix X & a number, m, of observations to include in the initial basic subset. Output: An initial basic subset of at least m observations. Two versions: Version 1 (V1): Initial subset selected based on Mahalanobis distances Version 2 (V2): Initial subset selected based on distances from the medians

72 BACON Algorithm (Billor, Hadi and Vellemann, 2000) Therefore two versions of BACON: One version: nearly affine equivariant, a high breakdown point (upwards of 40%), and computationally efficient even for very large datasets. Another version: affine equivariant, at the expense of a somewhat lower breakdown point (about 20%), but with the advantage of even lower computational cost even for very large datasets.

73 BACON Algorithm (cont.) Step 1: Divide the observations according to a suitably chosen initial distance d i into two subsets: Basic subset and Non basic subset Basic subset: a small subset initially of size m = cp observations with the smallest values of d i Non basic subset: the rest of the data Basic Non-basic 12 m n

74 BACON Algorithm (cont.) Step 2. Compute d i (x b,s b ) (x i x b ) T S b 1 (x i x b ),i 1,...,n x b where and S b are the mean and covariance of the basic subset. Step 3. Form a new basic subset by including all observations with c : a critical value (Chi squared table value). Step 4. Repeat Steps 2 3 until the size of the basic subset = n or all (x observations in the non basic subset have d i ) c and declare the observations in the nonbasic subset (if any) as outliers b,s b d i (x b,s b ) c

75 An Application for BACON: Average brain and body weights for 28 species of land animals.

76 Advantages of BACON algorithm Outlier detection methods have suffered in the past from a lack of generality and a computational cost that escalated rapidly with the sample size. Samples of a size sufficient to support sophisticated methods rapidly grow too large for previously published outlier detection methods to be practical. The BACON algorithms given here reliably detect multiple outliers at a cost that can be as low as four repetitions of the underlying fitting method. They are thus practical for data sets of even millions of cases. The BACON algorithms balance between affine equivariance and robustness.

77 Outlier Detection Methods based on Principal Component Analysis (PCA)

78 What is PCA? Generally speaking Two objectives: Data or Dimension reduction moving from many original variables down to a few composite variables (linear combinations of the original variables). Interpretation which variables play a larger role in the explanation of total variance. 78

79 PCA: Principal Component Analysis PC s are constructed by maximizing u i argmax{var(u'x)} u'u1 ' subject to u 'X Xu i j 0, 1 j i Z i = u i1 X u ip X p = u i X, i=1,2,,k the ith PC score or the ith principal component

80 Geometric understanding of PCA for 3D point cloud 1st PC direction (maximizing variance of projections) explains the cloud best. 1st and 2nd directions form a plane.

81 = X 257x235 k=2 k=10 k=50 k=100 k=200

82 PCA & Outlier Detection Methods that use classical PCs to identify potential outliers & Methods for robustly estimating PCs that may also be used to detect potential outliers.

83 Types of Outliers in PC space Orthogonal Good leverage Bad leverage Regular observations Good leverage Bad leverage

84 PCA based Outlier Detection Diagnostic Plot: To detect the type of observations in high dimensional data OD i X i ˆ zp i vs. D i k j1 z 2 ij l j, i=1,2,,n where the z i is the ith row oft nxk X nxp 1 n ˆ t P pxk l j : the eigenvalues of the variance covariance matrix (j=1,2,...,k) P : the matrix of eigenvectors corresponding to the eigenvalues

85 Diagnostic plot ODi Orthogonal outliers Bad leverage Regular Good leverage

86 Sensitivity of PCA to Outliers CPCA would mislead the analyst in the presence of outliers since the cov. matrix (or corr. matrix) is sensitive to outliers! Two dimensional example showing how outliers affect PCA

87 Robust PCA based Outlier Detection A) Replace the classical covariance matrix by a robust covariance estimator M estimator: Maronna (1976), Campbell (1980) MCD method: Rousseuw & Van Driessen (1999) S estimator: Davies (1987), Rousseeuw and Leroy (1987) PROBLEM: valid ONLY for p<n (no very large p) and NOT suitable for high dimensional data sets (p>n)

88 Robust PCA B) Approaches based on projection pursuit (pp) Li and Chen (1985) (high computational cost!) Croux, Ruiz Gazen, 2000) (numerical inaccuracy!) Hubert et al. (2002) (searches the direction on which the projected observations have the largest robust scale, and then removes this dimension and repeats.) Suitable for data sets with large p and/or large n! Problem: numerical inaccuracy and high computational cost! 88

89 Robust PCA C) Based on combination of ideas of PP and robust covariance estimation i) based on FAST MCD (Hubert et al., 2005) (ROBPCA) ii) based on BACON (RBPCA) (blocked adaptive computationally efficient outlier nominators) (Billor et al., 2000) 89

90 RBPCA: Steps in this algorithm Case 1: n>p Step 1. Use (Singular Value Decomposition ) SVD on the centered data matrix that is, X c X 1ˆ' UDP' (affine transformation of the data) Score matrix: T=X c P Step 2. Determine the mean ( ˆ B ) and the variance covariance matrix ( ˆ ) based on clean observations obtained from BACON. B Step 3. Find the robust PCs of the BACON based covariance matrix, and determine the # of PCs, as k. Step 4. The new robust score matrix T (X 1ˆ B ')P * 90

91 RBPCA: Steps in this algorithm Case 2: n<p Step 1. Use SVD (Singular Value Decomposition ) on the centered data matrix that is, X c X 1ˆ' UDP' (affine transformation of the data) Score matrix: T=X c P 91

92 RBPCA: Steps in this algorithm Step 2. Find the clean set of observations (say h=(n+p+1)/2) from T (n<p) by using the outlyingness measure Project the high dimensional datapoints on many (?) univariate directions v 92

93 Outlyingness measure For every direction v, find robust center, ˆ R and robust standard deviation, ˆ R for the projected x j v (j=1,2,,n). Find out which points are outlying on the projection vector by Outlyingness measure: Determine h clean observations (n/2 < h < n, h=0.75*n, or outl i R x max,i 1,..., n i v n p 1 ) h 2 x v ˆ ˆ R 93

94 Projecting points onto different projection vectors (dashed lines) B is outlying, but not A, C B is not outlying here 94

95 RBPCA: Steps in this algorithm Apply PCA on the data matrix of h clean observations (T hxp ) and obtain T T 1 ˆ 1 np Find the variance covariance matrix of T 1 Determine # of PCs, k < p st k<n. Step 3. Estimate the mean vector and scatter matrix of the PC scores matrix T 1 by using the BACON algorithm. Step 4. Obtain the robust PCs based on the robust variance covariance matrix in step 3. 95

96 Issues for High Dimensional Data Even for fast algorithms the computation times increase linearly with n and cubically with p! None of these methods do not work quite as well when the dimensionality is high!!!

97 PCOUT ALGORITHM (Filzmoser et al., 2008) A recent outlier identification algorithm & effective in high dimensions. Based on the robust distances obtained from semi robust principal components for robustly sphered data. Separate weights for location and scatter outliers are computed based on these distances. The combined weights are used for outlier identification (See R: pcout)

98 PCOUT Algorithm (Filzmoser et al., 2008) Step1. Calculate : Step 2. contribute to at least 99%of the total variance. Robustly spheredata :z The values {z Robustly sphere data : x sample covariance matrix of Compute PCs based on ij ij med(z mad(z } are used in 2phases of z ij ij med(x mad(x C, retain only those"p",..., z,..., z finding location outliers finding scatter outliers x ij 1j 1j the transformed data nj 1j ),..., x nj 1j ),..., x, j nj ) nj ) 1,2,...,p the further steps : PCs that

99 Calculate Mah. distances d L (weighted based on robust kurtosis) and d S (unweighted).calculate scatter and location weights from M d 1 c d M M c M d 1 c d 0,M ) c (d ; w i i 2 2 i i i PCOUT Algorithm (Filzmoser et al., 2008) For location: M is the 33(1/3 rd ) quantile of d L c= median(d)+2.5mad(d) For scale: M is,., c=,.

100 Final weights: w If LS i w (w LS i S i c)(w (1 c) L i 2 c),c , ith observation is an outlier.

101 PCOUT: Leukemia Data (72x7129) We will try to identify multivariate outliers among the 7129 genes, without using the information of the two leukemia types ALL and AML. The outlying genes will then be used for differentiating between the cases genes

102 Outliers in Functional Data

103 Introduction FDA : collection of different methods in statistical analysis for analyzing curves or functional data. In standard statistical analysis: focus is on the set of data vectors (univariate, multivariate). In FDA, focus is on the type of data structure such as curves, shapes, images, or set of functional observations.

104 What are Functional Data about? Figure: The change in temperature over the course of a year, taken from thirty five weather stations across Canada (Ramsay and Silverman, 2001). Atlantic stations in red, Continental in blue, Pacific in green, and Arctic in black.

105 What Questions can we ask of the functional data? Statisticians: How can I represent the temperature pattern of a Canadian city over the entire year instead of just looking at the twelve discrete points? Should I just "connect the dots," or is there a better way to do this? Do the summary statistics "mean" and "covariance" have any meaning when I'm dealing with curves? How can I determine the primary modes of variation in the data? How many typical modes can summarize these thirty five curves? Do these curves exhibit strictly sinusoidal behavior? Can I create an analysis of variance (ANOVA) or linear model with the curves as the response and the climate as the main effect?

106 Outliers in Functional Data Problem: If we have some curves that behave differently from the rest (i.e. outliers in functional form), what happens to the FDA techniques???

107 Outliers in Functional Data The study of outlier detection has started only recently, and was mostly limited to univariate curves, i.e. p = 1. Febrero Bande et al (2008) identified two reasons why outliers can be present in functional data: 1. gross errors can be caused by errors in measurements and recording or typing mistakes, which should be identified and corrected if possible. 2. Second, outliers can be correctly observed data curves that are suspicious or surprising in the sense that they do not follow the same pattern as that of the majority of the curves.

108 Methods Functional Depth based Functional PCA based Functional Boxplot

109 Functional Depth based

110 What is Data Depth? Iso-Depth curves Notion of data depth for non parametric multivariate data analysis Provides center outward orderings of points in Euclidean space of any dimension and leads to a new nonparametric multivariate statistical analysis in which no distributional assumptions are needed. Deepest point A data depth measures how deep (or central) a given point x in R p is relative to F, a probability distribution in R p (assuming {X 1,.., X n } is a random sample from F) or relative to a given data cloud. 110

111 Consider x 1 (t), x 2 (t),..., x n (t) : n functions defined on an interval I. Question: Which one is the deepest function?

112 Depth for functional data. used to measure the centrality of a curve with respect to a set of curves. E.g.: to define the deepest function. The functional depth provides a center outward ordering of a sample of curves. Order statistics are defined. The idea of deepest point of a set of data allows to classify a new observation by using the distance to a class deepest point.

113 Many Data Depths Functions in MV 1. The Mahalanobis depth (Mahalanobis, 1936). 2. The half space depth (Hodges, 1955, Tukey, 1975). 3. The Oja depth (Oja, 1983). 4. The simplicial depth (Liu, 1990). 5. The majority depth (Singh, 1991). 6. The projection depth (Zuo, 2003). Example: MD(y,F) ' 1 1 y y 1 F F F

114 The Fraiman and Muniz Depth (FMD) A functional data depth (Fraiman and Muniz (2001)): F n,t (x i (t)) : empirical cdf of the values of the curves x 1 (t),..., x n (t) at a given time point t [a, b], given by Fraiman and Muniz functional depth, hereafter FMD, of a curve x i with respect the set x 1,..., x n: where D n (x i (t)) is the univariate depth of the point x i (t) given by 1 1 2,

115 Functional Depth based Method Febrero Bande et al (2008) proposed the following outlier detection procedure for univariate functional data (i.e. p = 1) Step 1. For each curve, calculate its functional depth (several versions exist); Step 2. Delete observations with depth below a cutoff C; Step 3. Go back to step 1 with the reduced sample, and repeat until no outliers are found. Step 3 was added in the hope of avoiding masking effects. The cutoff value C was obtained by a bootstrap procedure.

116 Example: Outlier Curves

117

118 Pros & Cons Fast computation; Non parametric method. Dependent on chosen Depth. Choice of cutoff C is complicated.

119 Functional PCA based

120 Functional Principal Components Analysis Similar to multivariate Principal Components Analysis (PCA), but instead of vectors (x i1,, x in ), we use curves x i (t); Summations become integrals, e.g., PC Scores corresponding to is Finding the PC is equivalent to solving finding the eigenfunctions/eigenvalues of covariance operator G(s, t), that is where G(s,t) = Cov(x(s), x(t)).

121 Functional PCA using basis expansion Select an appropriate orthogonal base of L 2, Φ k (t), k = 1, 2, Can represent each curve x i (t): where the number of basis, k, is selected via Generalized Cross Validation, for example. The coefficients can be estimated using Least Squares. Assume y ij = x i (t j ) + ε ij, ε ij is a measurement error; Obtain matrix C = (c ik ). Outliers in C will be the functional outliers. Apply Robust Multivariate PCA to C.

122 Robust PCA Examples Spherical PCA projects each observation in the unit sphere: where is a robust estimator of the location parameter. ROBPCA Uses Minimum Covariance Determinant (MCD) for low dimensional data ( n > p), or a combination of Projection Pursuit and MCD for high dimensional data (n>p). BACON PCA Block Adaptive Computationally Efficient Outlier Nominators (BACON):

123 Diagnostic Plots for Robust PCA Score distances: i = 1,,n where z ij are the PC Scores, and λ j are the eigenvalues; Orthogonal Distance: is the robust center estimator, is the matrix of eigenvalues.

124 Diagnostic Plots for Robust PCA Sd i large and Od i small 1, 4; Sd i small and Od i large 5; Sd i large and Od i large 2, 3;

125 Poblenou NOx Data (Whole Dataset) Sawant et al (2012) identified 5 outlying curves using a robust functional PCA approach. Identified outliers were all working days 03/09; 03/11; 03/18; 04/29 and 05/02. Outliers were on days leading to a long weekend or vacation period and hence there was increased traffic flow. NOx emissions in Poblenou, Barcelona (Spain) over 115 days. Hourly measurements of the NOx made from 23 February 2005 to 26 June 2005.

126 CFPCA RFPCA-MCD RFPCA-BACON

127

128

129 Conclusion After detecting the outliers, we checked for sources for abnormal values of these curves. It was found that the days detected as outliers were weekends or related to small vacation periods around weekends. So we conclude that abnormal observations on specific days can be attributed to increase in traffic due to small vacation periods. We have also detected outlier on Wednesday, March 9. The observation on 10th March has missing data and thus not included inanalysis. So we could not pinpoint the reason behind this abnormal observation on 9 th March.

130 Functional Boxplot

131 Example (Functional Boxplot) Data from monthly sea temperatures over the East Central tropical Pacific ocean,

132 Functional Box Plot Extend the univariate box plot using Data Depth The functional boxplot of Sea Surface Temp with 1. blue curves denoting envelopes, and 2. a black curve representing the median curve. 3. The red dashed curves are the outlier candidates detected by the 1.5 times the 50% central region rule.

133 Functional Box Plot The enhanced functional boxplot of SST with 1. dark magenta denoting the 25% central region, 2. magenta representing the 50% central region and 3. pink indicating the 75% central region.

134 Functional & Pointwise Boxplots The pointwise boxplots of SST with medians connected by a black line.

135 Pros & Cons Fast computation. Clear visualization. Choice of BD and MBD may not be optimal. Bad performance for shape outliers. The command fbplot for functional boxplots is in fda R package, and MATLAB code is also available.

136 Selected References 1. Barnett, V., and T. Lewis (1994). Outliers in Statistical Data, 3rd ed., New York: Wiley. 2. Billor, N., Hadi, A., & Velleman, P., (2000). BACON: Blocked adaptive computationally efficient outlier nominators. Computational Statistics and Data Analysis,34, Filzmoser, P., Maronna, R. & Werner, M. (2008). Outlier identification in high dimensions. Computational Statistics and Data Analysis, Vol. 52, pp Hyndman, R. J. and Shang, H.L. (2010). "Rainbow Plots, Bagplots, and Boxplots for Functional Data". Journal of Computational and Graphical Statistics 19 (1): López Pintado, S. and Romo, J. (2009). "On the Concept of Depth for Functional Data". Journal of the American Statistical Association 104 (486): Sun, Y. and Genton, M. G. (2011). "Functional boxplots". Journal of Computational and Graphical Statistics 20: Sawant, P., Billor, N., and Shin, H. (2012) Functional outlier detection with robust functional principal component analysis. Computational Statistics, 27,

OUTLIER MINING IN HIGH DIMENSIONAL DATASETS

OUTLIER MINING IN HIGH DIMENSIONAL DATASETS OUTLIER MINING IN HIGH DIMENSIONAL DATASETS DATA MINING DISCUSSION GROUP OUTLINE MOTIVATION OUTLIERS IN MULTIVARIATE DATA OUTLIERS IN HIGH DIMENSIONAL DATA Distribution-based Distance-based NN-based Density-based

More information

Chapter 5: Outlier Detection

Chapter 5: Outlier Detection Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases SS 2016 Chapter 5: Outlier Detection Lecture: Prof. Dr.

More information

DATA MINING II - 1DL460

DATA MINING II - 1DL460 DATA MINING II - 1DL460 Spring 2016 A second course in data mining!! http://www.it.uu.se/edu/course/homepage/infoutv2/vt16 Kjell Orsborn! Uppsala Database Laboratory! Department of Information Technology,

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data.

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data. Summary Statistics Acquisition Description Exploration Examination what data is collected Characterizing properties of data. Exploring the data distribution(s). Identifying data quality problems. Selecting

More information

Visualization tools for uncertainty and sensitivity analyses on thermal-hydraulic transients

Visualization tools for uncertainty and sensitivity analyses on thermal-hydraulic transients Joint International Conference on Supercomputing in Nuclear Applications and Monte Carlo 013 (SNA + MC 013) La Cité des Sciences et de l Industrie, Paris, France, October 7-31, 013 Visualization tools

More information

Chapter 9 Robust Regression Examples

Chapter 9 Robust Regression Examples Chapter 9 Robust Regression Examples Chapter Table of Contents OVERVIEW...177 FlowChartforLMS,LTS,andMVE...179 EXAMPLES USING LMS AND LTS REGRESSION...180 Example 9.1 LMS and LTS with Substantial Leverage

More information

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Exploratory data analysis tasks Examine the data, in search of structures

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification,

More information

A new depth-based approach for detecting outlying curves

A new depth-based approach for detecting outlying curves A new depth-based approach for detecting outlying curves Mia Hubert, Department of Mathematics & LStat, KU Leuven, mia.hubert@wis.kuleuven.be Gerda Claeskens, ORSTAT & LStat, KU Leuven, gerda.claeskens@econ.kuleuven.be

More information

The Use of Biplot Analysis and Euclidean Distance with Procrustes Measure for Outliers Detection

The Use of Biplot Analysis and Euclidean Distance with Procrustes Measure for Outliers Detection Volume-8, Issue-1 February 2018 International Journal of Engineering and Management Research Page Number: 194-200 The Use of Biplot Analysis and Euclidean Distance with Procrustes Measure for Outliers

More information

A Meta analysis study of outlier detection methods in classification

A Meta analysis study of outlier detection methods in classification A Meta analysis study of outlier detection methods in classification Edgar Acuna and Caroline Rodriguez edgar@cs.uprm.edu, caroline@math.uprm.edu Department of Mathematics University of Puerto Rico at

More information

Spatial Outlier Detection

Spatial Outlier Detection Spatial Outlier Detection Chang-Tien Lu Department of Computer Science Northern Virginia Center Virginia Tech Joint work with Dechang Chen, Yufeng Kou, Jiang Zhao 1 Spatial Outlier A spatial data point

More information

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 1st, 2018

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 1st, 2018 Data Mining CS57300 Purdue University Bruno Ribeiro February 1st, 2018 1 Exploratory Data Analysis & Feature Construction How to explore a dataset Understanding the variables (values, ranges, and empirical

More information

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha Data Preprocessing S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha 1 Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking

More information

Anomaly Detection on Data Streams with High Dimensional Data Environment

Anomaly Detection on Data Streams with High Dimensional Data Environment Anomaly Detection on Data Streams with High Dimensional Data Environment Mr. D. Gokul Prasath 1, Dr. R. Sivaraj, M.E, Ph.D., 2 Department of CSE, Velalar College of Engineering & Technology, Erode 1 Assistant

More information

CPSC 340: Machine Learning and Data Mining. Outlier Detection Fall 2016

CPSC 340: Machine Learning and Data Mining. Outlier Detection Fall 2016 CPSC 340: Machine Learning and Data Mining Outlier Detection Fall 2016 Admin Assignment 1 solutions will be posted after class. Assignment 2 is out: Due next Friday, but start early! Calculus and linear

More information

STA Rev. F Learning Objectives. Learning Objectives (Cont.) Module 3 Descriptive Measures

STA Rev. F Learning Objectives. Learning Objectives (Cont.) Module 3 Descriptive Measures STA 2023 Module 3 Descriptive Measures Learning Objectives Upon completing this module, you should be able to: 1. Explain the purpose of a measure of center. 2. Obtain and interpret the mean, median, and

More information

CPSC 340: Machine Learning and Data Mining. Outlier Detection Fall 2018

CPSC 340: Machine Learning and Data Mining. Outlier Detection Fall 2018 CPSC 340: Machine Learning and Data Mining Outlier Detection Fall 2018 Admin Assignment 2 is due Friday. Assignment 1 grades available? Midterm rooms are now booked. October 18 th at 6:30pm (BUCH A102

More information

Data Warehousing. Data Warehousing and Mining. Lecture 8. by Hossen Asiful Mustafa

Data Warehousing. Data Warehousing and Mining. Lecture 8. by Hossen Asiful Mustafa Data Warehousing Data Warehousing and Mining Lecture 8 by Hossen Asiful Mustafa Databases Databases are developed on the IDEA that DATA is one of the critical materials of the Information Age Information,

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2013 http://ce.sharif.edu/courses/91-92/2/ce725-1/ Agenda Features and Patterns The Curse of Size and

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2012 http://ce.sharif.edu/courses/90-91/2/ce725-1/ Agenda Features and Patterns The Curse of Size and

More information

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Classification Vladimir Curic Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Outline An overview on classification Basics of classification How to choose appropriate

More information

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Outline Introduction Data pre-treatment 1. Normalization 2. Centering,

More information

Outlier Detection. Chapter 12

Outlier Detection. Chapter 12 Contents 12 Outlier Detection 3 12.1 Outliers and Outlier Analysis.................... 4 12.1.1 What Are Outliers?..................... 4 12.1.2 Types of Outliers....................... 5 12.1.3 Challenges

More information

INF 4300 Classification III Anne Solberg The agenda today:

INF 4300 Classification III Anne Solberg The agenda today: INF 4300 Classification III Anne Solberg 28.10.15 The agenda today: More on estimating classifier accuracy Curse of dimensionality and simple feature selection knn-classification K-means clustering 28.10.15

More information

A Modified Approach for Detection of Outliers

A Modified Approach for Detection of Outliers A Modified Approach for Detection of Outliers Iftikhar Hussain Adil Department of Economics School of Social Sciences and Humanities National University of Sciences and Technology Islamabad Iftikhar.adil@s3h.nust.edu.pk

More information

Using Machine Learning to Optimize Storage Systems

Using Machine Learning to Optimize Storage Systems Using Machine Learning to Optimize Storage Systems Dr. Kiran Gunnam 1 Outline 1. Overview 2. Building Flash Models using Logistic Regression. 3. Storage Object classification 4. Storage Allocation recommendation

More information

STA Module 2B Organizing Data and Comparing Distributions (Part II)

STA Module 2B Organizing Data and Comparing Distributions (Part II) STA 2023 Module 2B Organizing Data and Comparing Distributions (Part II) Learning Objectives Upon completing this module, you should be able to 1 Explain the purpose of a measure of center 2 Obtain and

More information

STA Learning Objectives. Learning Objectives (cont.) Module 2B Organizing Data and Comparing Distributions (Part II)

STA Learning Objectives. Learning Objectives (cont.) Module 2B Organizing Data and Comparing Distributions (Part II) STA 2023 Module 2B Organizing Data and Comparing Distributions (Part II) Learning Objectives Upon completing this module, you should be able to 1 Explain the purpose of a measure of center 2 Obtain and

More information

Robust Kernel Methods in Clustering and Dimensionality Reduction Problems

Robust Kernel Methods in Clustering and Dimensionality Reduction Problems Robust Kernel Methods in Clustering and Dimensionality Reduction Problems Jian Guo, Debadyuti Roy, Jing Wang University of Michigan, Department of Statistics Introduction In this report we propose robust

More information

The Curse of Dimensionality

The Curse of Dimensionality The Curse of Dimensionality ACAS 2002 p1/66 Curse of Dimensionality The basic idea of the curse of dimensionality is that high dimensional data is difficult to work with for several reasons: Adding more

More information

Machine Learning in the Wild. Dealing with Messy Data. Rajmonda S. Caceres. SDS 293 Smith College October 30, 2017

Machine Learning in the Wild. Dealing with Messy Data. Rajmonda S. Caceres. SDS 293 Smith College October 30, 2017 Machine Learning in the Wild Dealing with Messy Data Rajmonda S. Caceres SDS 293 Smith College October 30, 2017 Analytical Chain: From Data to Actions Data Collection Data Cleaning/ Preparation Analysis

More information

Points Lines Connected points X-Y Scatter. X-Y Matrix Star Plot Histogram Box Plot. Bar Group Bar Stacked H-Bar Grouped H-Bar Stacked

Points Lines Connected points X-Y Scatter. X-Y Matrix Star Plot Histogram Box Plot. Bar Group Bar Stacked H-Bar Grouped H-Bar Stacked Plotting Menu: QCExpert Plotting Module graphs offers various tools for visualization of uni- and multivariate data. Settings and options in different types of graphs allow for modifications and customizations

More information

Applying the Q n Estimator Online

Applying the Q n Estimator Online Applying the Q n Estimator Online Robin Nunkesser 1, Karen Schettlinger 2, and Roland Fried 2 1 Department of Computer Science, Univ. Dortmund, 44221 Dortmund Robin.Nunkesser@udo.edu 2 Department of Statistics,

More information

PSS718 - Data Mining

PSS718 - Data Mining Lecture 5 - Hacettepe University October 23, 2016 Data Issues Improving the performance of a model To improve the performance of a model, we mostly improve the data Source additional data Clean up the

More information

Exploratory Data Analysis

Exploratory Data Analysis Chapter 10 Exploratory Data Analysis Definition of Exploratory Data Analysis (page 410) Definition 12.1. Exploratory data analysis (EDA) is a subfield of applied statistics that is concerned with the investigation

More information

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order.

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order. Chapter 2 2.1 Descriptive Statistics A stem-and-leaf graph, also called a stemplot, allows for a nice overview of quantitative data without losing information on individual observations. It can be a good

More information

Applying Supervised Learning

Applying Supervised Learning Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains

More information

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a Week 9 Based in part on slides from textbook, slides of Susan Holmes Part I December 2, 2012 Hierarchical Clustering 1 / 1 Produces a set of nested clusters organized as a Hierarchical hierarchical clustering

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Features and Patterns The Curse of Size and

More information

ECONOMIC DESIGN OF STATISTICAL PROCESS CONTROL USING PRINCIPAL COMPONENTS ANALYSIS AND THE SIMPLICIAL DEPTH RANK CONTROL CHART

ECONOMIC DESIGN OF STATISTICAL PROCESS CONTROL USING PRINCIPAL COMPONENTS ANALYSIS AND THE SIMPLICIAL DEPTH RANK CONTROL CHART ECONOMIC DESIGN OF STATISTICAL PROCESS CONTROL USING PRINCIPAL COMPONENTS ANALYSIS AND THE SIMPLICIAL DEPTH RANK CONTROL CHART Vadhana Jayathavaj Rangsit University, Thailand vadhana.j@rsu.ac.th Adisak

More information

Road Map. Data types Measuring data Data cleaning Data integration Data transformation Data reduction Data discretization Summary

Road Map. Data types Measuring data Data cleaning Data integration Data transformation Data reduction Data discretization Summary 2. Data preprocessing Road Map Data types Measuring data Data cleaning Data integration Data transformation Data reduction Data discretization Summary 2 Data types Categorical vs. Numerical Scale types

More information

Advanced Applied Multivariate Analysis

Advanced Applied Multivariate Analysis Advanced Applied Multivariate Analysis STAT, Fall 3 Sungkyu Jung Department of Statistics University of Pittsburgh E-mail: sungkyu@pitt.edu http://www.stat.pitt.edu/sungkyu/ / 3 General Information Course

More information

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables Further Maths Notes Common Mistakes Read the bold words in the exam! Always check data entry Remember to interpret data with the multipliers specified (e.g. in thousands) Write equations in terms of variables

More information

Discriminate Analysis

Discriminate Analysis Discriminate Analysis Outline Introduction Linear Discriminant Analysis Examples 1 Introduction What is Discriminant Analysis? Statistical technique to classify objects into mutually exclusive and exhaustive

More information

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools Abstract In this project, we study 374 public high schools in New York City. The project seeks to use regression techniques

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3. Chapter 3: Data Preprocessing. Major Tasks in Data Preprocessing

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3. Chapter 3: Data Preprocessing. Major Tasks in Data Preprocessing Data Mining: Concepts and Techniques (3 rd ed.) Chapter 3 1 Chapter 3: Data Preprocessing Data Preprocessing: An Overview Data Quality Major Tasks in Data Preprocessing Data Cleaning Data Integration Data

More information

THE L.L. THURSTONE PSYCHOMETRIC LABORATORY UNIVERSITY OF NORTH CAROLINA. Forrest W. Young & Carla M. Bann

THE L.L. THURSTONE PSYCHOMETRIC LABORATORY UNIVERSITY OF NORTH CAROLINA. Forrest W. Young & Carla M. Bann Forrest W. Young & Carla M. Bann THE L.L. THURSTONE PSYCHOMETRIC LABORATORY UNIVERSITY OF NORTH CAROLINA CB 3270 DAVIE HALL, CHAPEL HILL N.C., USA 27599-3270 VISUAL STATISTICS PROJECT WWW.VISUALSTATS.ORG

More information

Things you ll know (or know better to watch out for!) when you leave in December: 1. What you can and cannot infer from graphs.

Things you ll know (or know better to watch out for!) when you leave in December: 1. What you can and cannot infer from graphs. 1 2 Things you ll know (or know better to watch out for!) when you leave in December: 1. What you can and cannot infer from graphs. 2. How to construct (in your head!) and interpret confidence intervals.

More information

Data: a collection of numbers or facts that require further processing before they are meaningful

Data: a collection of numbers or facts that require further processing before they are meaningful Digital Image Classification Data vs. Information Data: a collection of numbers or facts that require further processing before they are meaningful Information: Derived knowledge from raw data. Something

More information

Data Preprocessing. Javier Béjar. URL - Spring 2018 CS - MAI 1/78 BY: $\

Data Preprocessing. Javier Béjar. URL - Spring 2018 CS - MAI 1/78 BY: $\ Data Preprocessing Javier Béjar BY: $\ URL - Spring 2018 C CS - MAI 1/78 Introduction Data representation Unstructured datasets: Examples described by a flat set of attributes: attribute-value matrix Structured

More information

Data Preprocessing. Javier Béjar AMLT /2017 CS - MAI. (CS - MAI) Data Preprocessing AMLT / / 71 BY: $\

Data Preprocessing. Javier Béjar AMLT /2017 CS - MAI. (CS - MAI) Data Preprocessing AMLT / / 71 BY: $\ Data Preprocessing S - MAI AMLT - 2016/2017 (S - MAI) Data Preprocessing AMLT - 2016/2017 1 / 71 Outline 1 Introduction Data Representation 2 Data Preprocessing Outliers Missing Values Normalization Discretization

More information

CPSC 340: Machine Learning and Data Mining. Hierarchical Clustering Fall 2017

CPSC 340: Machine Learning and Data Mining. Hierarchical Clustering Fall 2017 CPSC 340: Machine Learning and Data Mining Hierarchical Clustering Fall 2017 Assignment 1 is due Friday. Admin Follow the assignment guidelines naming convention (a1.zip/a1.pdf). Assignment 0 grades posted

More information

Data Exploration and Preparation Data Mining and Text Mining (UIC Politecnico di Milano)

Data Exploration and Preparation Data Mining and Text Mining (UIC Politecnico di Milano) Data Exploration and Preparation Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Jiawei Han and Micheline Kamber, "Data Mining, : Concepts and Techniques", The Morgan Kaufmann

More information

Workload Characterization Techniques

Workload Characterization Techniques Workload Characterization Techniques Raj Jain Washington University in Saint Louis Saint Louis, MO 63130 Jain@cse.wustl.edu These slides are available on-line at: http://www.cse.wustl.edu/~jain/cse567-08/

More information

Part I, Chapters 4 & 5. Data Tables and Data Analysis Statistics and Figures

Part I, Chapters 4 & 5. Data Tables and Data Analysis Statistics and Figures Part I, Chapters 4 & 5 Data Tables and Data Analysis Statistics and Figures Descriptive Statistics 1 Are data points clumped? (order variable / exp. variable) Concentrated around one value? Concentrated

More information

STA 570 Spring Lecture 5 Tuesday, Feb 1

STA 570 Spring Lecture 5 Tuesday, Feb 1 STA 570 Spring 2011 Lecture 5 Tuesday, Feb 1 Descriptive Statistics Summarizing Univariate Data o Standard Deviation, Empirical Rule, IQR o Boxplots Summarizing Bivariate Data o Contingency Tables o Row

More information

ECLT 5810 Data Preprocessing. Prof. Wai Lam

ECLT 5810 Data Preprocessing. Prof. Wai Lam ECLT 5810 Data Preprocessing Prof. Wai Lam Why Data Preprocessing? Data in the real world is imperfect incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate

More information

Exact fast computation of band depth for large functional datasets: How quickly can one million curves be ranked?

Exact fast computation of band depth for large functional datasets: How quickly can one million curves be ranked? Exact fast computation of band depth for large functional datasets: How quickly can one million curves be ranked? Ying Sun a,, Marc G. Genton b and Douglas W. Nychka c Received 4 September 01; Accepted

More information

IAT 355 Visual Analytics. Data and Statistical Models. Lyn Bartram

IAT 355 Visual Analytics. Data and Statistical Models. Lyn Bartram IAT 355 Visual Analytics Data and Statistical Models Lyn Bartram Exploring data Example: US Census People # of people in group Year # 1850 2000 (every decade) Age # 0 90+ Sex (Gender) # Male, female Marital

More information

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin Clustering K-means Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, 2014 Carlos Guestrin 2005-2014 1 Clustering images Set of Images [Goldberger et al.] Carlos Guestrin 2005-2014

More information

DATA DEPTH AND ITS APPLICATIONS IN CLASSIFICATION

DATA DEPTH AND ITS APPLICATIONS IN CLASSIFICATION DATA DEPTH AND ITS APPLICATIONS IN CLASSIFICATION Ondrej Vencalek Department of Mathematical Analysis and Applications of Mathematics Palacky University Olomouc, CZECH REPUBLIC e-mail: ondrej.vencalek@upol.cz

More information

CPSC 340: Machine Learning and Data Mining. Kernel Trick Fall 2017

CPSC 340: Machine Learning and Data Mining. Kernel Trick Fall 2017 CPSC 340: Machine Learning and Data Mining Kernel Trick Fall 2017 Admin Assignment 3: Due Friday. Midterm: Can view your exam during instructor office hours or after class this week. Digression: the other

More information

Basics of Multivariate Modelling and Data Analysis

Basics of Multivariate Modelling and Data Analysis Basics of Multivariate Modelling and Data Analysis Kurt-Erik Häggblom 9. Linear regression with latent variables 9.1 Principal component regression (PCR) 9.2 Partial least-squares regression (PLS) [ mostly

More information

EE795: Computer Vision and Intelligent Systems

EE795: Computer Vision and Intelligent Systems EE795: Computer Vision and Intelligent Systems Spring 2012 TTh 17:30-18:45 WRI C225 Lecture 04 130131 http://www.ee.unlv.edu/~b1morris/ecg795/ 2 Outline Review Histogram Equalization Image Filtering Linear

More information

Data Mining: Exploring Data. Lecture Notes for Chapter 3

Data Mining: Exploring Data. Lecture Notes for Chapter 3 Data Mining: Exploring Data Lecture Notes for Chapter 3 Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler Look for accompanying R code on the course web site. Topics Exploratory Data Analysis

More information

WELCOME! Lecture 3 Thommy Perlinger

WELCOME! Lecture 3 Thommy Perlinger Quantitative Methods II WELCOME! Lecture 3 Thommy Perlinger Program Lecture 3 Cleaning and transforming data Graphical examination of the data Missing Values Graphical examination of the data It is important

More information

Unsupervised learning in Vision

Unsupervised learning in Vision Chapter 7 Unsupervised learning in Vision The fields of Computer Vision and Machine Learning complement each other in a very natural way: the aim of the former is to extract useful information from visual

More information

IT 403 Practice Problems (1-2) Answers

IT 403 Practice Problems (1-2) Answers IT 403 Practice Problems (1-2) Answers #1. Using Tukey's Hinges method ('Inclusionary'), what is Q3 for this dataset? 2 3 5 7 11 13 17 a. 7 b. 11 c. 12 d. 15 c (12) #2. How do quartiles and percentiles

More information

2. Data Preprocessing

2. Data Preprocessing 2. Data Preprocessing Contents of this Chapter 2.1 Introduction 2.2 Data cleaning 2.3 Data integration 2.4 Data transformation 2.5 Data reduction Reference: [Han and Kamber 2006, Chapter 2] SFU, CMPT 459

More information

Unsupervised Learning : Clustering

Unsupervised Learning : Clustering Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex

More information

ECE 285 Class Project Report

ECE 285 Class Project Report ECE 285 Class Project Report Based on Source localization in an ocean waveguide using supervised machine learning Yiwen Gong ( yig122@eng.ucsd.edu), Yu Chai( yuc385@eng.ucsd.edu ), Yifeng Bu( ybu@eng.ucsd.edu

More information

CS 229 Midterm Review

CS 229 Midterm Review CS 229 Midterm Review Course Staff Fall 2018 11/2/2018 Outline Today: SVMs Kernels Tree Ensembles EM Algorithm / Mixture Models [ Focus on building intuition, less so on solving specific problems. Ask

More information

DI TRANSFORM. The regressive analyses. identify relationships

DI TRANSFORM. The regressive analyses. identify relationships July 2, 2015 DI TRANSFORM MVstats TM Algorithm Overview Summary The DI Transform Multivariate Statistics (MVstats TM ) package includes five algorithm options that operate on most types of geologic, geophysical,

More information

What s New in iogas Version 6.3

What s New in iogas Version 6.3 What s New in iogas Version 6.3 November 2017 Contents Import Tools... 2 Support for IMDEXHUB-IQ Database... 2 Import Random Subsample... 2 Stereonet Contours... 2 Gridding and contouring... 2 Analysis

More information

DETECTION OF OUTLIERS IN TWSTFT DATA USED IN TAI

DETECTION OF OUTLIERS IN TWSTFT DATA USED IN TAI DETECTION OF OUTLIERS IN TWSTFT DATA USED IN TAI A. Harmegnies, G. Panfilo, and E. F. Arias International Bureau of Weights and Measures (BIPM) Pavillon de Breteuil F-92312 Sèvres Cedex, France E-mail:

More information

Exploratory data analysis for microarrays

Exploratory data analysis for microarrays Exploratory data analysis for microarrays Jörg Rahnenführer Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken Germany NGFN - Courses in Practical DNA

More information

Lecture 3: Linear Classification

Lecture 3: Linear Classification Lecture 3: Linear Classification Roger Grosse 1 Introduction Last week, we saw an example of a learning task called regression. There, the goal was to predict a scalar-valued target from a set of features.

More information

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering Introduction to Pattern Recognition Part II Selim Aksoy Bilkent University Department of Computer Engineering saksoy@cs.bilkent.edu.tr RETINA Pattern Recognition Tutorial, Summer 2005 Overview Statistical

More information

Feature Selection for fmri Classification

Feature Selection for fmri Classification Feature Selection for fmri Classification Chuang Wu Program of Computational Biology Carnegie Mellon University Pittsburgh, PA 15213 chuangw@andrew.cmu.edu Abstract The functional Magnetic Resonance Imaging

More information

Chuck Cartledge, PhD. 20 January 2018

Chuck Cartledge, PhD. 20 January 2018 Big Data: Data Analysis Boot Camp Visualizing the Iris Dataset Chuck Cartledge, PhD 20 January 2018 1/31 Table of contents (1 of 1) 1 Intro. 2 Histograms Background 3 Scatter plots 4 Box plots 5 Outliers

More information

SELECTION OF A MULTIVARIATE CALIBRATION METHOD

SELECTION OF A MULTIVARIATE CALIBRATION METHOD SELECTION OF A MULTIVARIATE CALIBRATION METHOD 0. Aim of this document Different types of multivariate calibration methods are available. The aim of this document is to help the user select the proper

More information

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering SYDE 372 - Winter 2011 Introduction to Pattern Recognition Clustering Alexander Wong Department of Systems Design Engineering University of Waterloo Outline 1 2 3 4 5 All the approaches we have learned

More information

Machine Learning for Pre-emptive Identification of Performance Problems in UNIX Servers Helen Cunningham

Machine Learning for Pre-emptive Identification of Performance Problems in UNIX Servers Helen Cunningham Final Report for cs229: Machine Learning for Pre-emptive Identification of Performance Problems in UNIX Servers Helen Cunningham Abstract. The goal of this work is to use machine learning to understand

More information

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. CS 189 Spring 2015 Introduction to Machine Learning Final You have 2 hours 50 minutes for the exam. The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. No calculators or

More information

Regression III: Advanced Methods

Regression III: Advanced Methods Lecture 3: Distributions Regression III: Advanced Methods William G. Jacoby Michigan State University Goals of the lecture Examine data in graphical form Graphs for looking at univariate distributions

More information

MSA220 - Statistical Learning for Big Data

MSA220 - Statistical Learning for Big Data MSA220 - Statistical Learning for Big Data Lecture 13 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Clustering Explorative analysis - finding groups

More information

To calculate the arithmetic mean, sum all the values and divide by n (equivalently, multiple 1/n): 1 n. = 29 years.

To calculate the arithmetic mean, sum all the values and divide by n (equivalently, multiple 1/n): 1 n. = 29 years. 3: Summary Statistics Notation Consider these 10 ages (in years): 1 4 5 11 30 50 8 7 4 5 The symbol n represents the sample size (n = 10). The capital letter X denotes the variable. x i represents the

More information

Vocabulary. 5-number summary Rule. Area principle. Bar chart. Boxplot. Categorical data condition. Categorical variable.

Vocabulary. 5-number summary Rule. Area principle. Bar chart. Boxplot. Categorical data condition. Categorical variable. 5-number summary 68-95-99.7 Rule Area principle Bar chart Bimodal Boxplot Case Categorical data Categorical variable Center Changing center and spread Conditional distribution Context Contingency table

More information

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining CPSC 340: Machine Learning and Data Mining Hierarchical Clustering and Outlier Detection Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. Admin Assignment 2 is due

More information

Dimension Reduction CS534

Dimension Reduction CS534 Dimension Reduction CS534 Why dimension reduction? High dimensionality large number of features E.g., documents represented by thousands of words, millions of bigrams Images represented by thousands of

More information

On the Impact of Outliers on High-Dimensional Data Analysis Methods for Face Recognition

On the Impact of Outliers on High-Dimensional Data Analysis Methods for Face Recognition On the Impact of Outliers on High-Dimensional Data Analysis Methods for Face Recognition ABSTRACT Sid-Ahmed Berrani France Telecom R&D TECH/IRIS 4, rue du Clos Courtel BP 91226 35512 Cesson Sévigné Cedex,

More information

Variable Selection 6.783, Biomedical Decision Support

Variable Selection 6.783, Biomedical Decision Support 6.783, Biomedical Decision Support (lrosasco@mit.edu) Department of Brain and Cognitive Science- MIT November 2, 2009 About this class Why selecting variables Approaches to variable selection Sparsity-based

More information

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset. Glossary of data mining terms: Accuracy Accuracy is an important factor in assessing the success of data mining. When applied to data, accuracy refers to the rate of correct values in the data. When applied

More information

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display CURRICULUM MAP TEMPLATE Priority Standards = Approximately 70% Supporting Standards = Approximately 20% Additional Standards = Approximately 10% HONORS PROBABILITY AND STATISTICS Essential Questions &

More information

CHAPTER 1. Introduction. Statistics: Statistics is the science of collecting, organizing, analyzing, presenting and interpreting data.

CHAPTER 1. Introduction. Statistics: Statistics is the science of collecting, organizing, analyzing, presenting and interpreting data. 1 CHAPTER 1 Introduction Statistics: Statistics is the science of collecting, organizing, analyzing, presenting and interpreting data. Variable: Any characteristic of a person or thing that can be expressed

More information

Detection of Anomalies using Online Oversampling PCA

Detection of Anomalies using Online Oversampling PCA Detection of Anomalies using Online Oversampling PCA Miss Supriya A. Bagane, Prof. Sonali Patil Abstract Anomaly detection is the process of identifying unexpected behavior and it is an important research

More information

Visual Representations for Machine Learning

Visual Representations for Machine Learning Visual Representations for Machine Learning Spectral Clustering and Channel Representations Lecture 1 Spectral Clustering: introduction and confusion Michael Felsberg Klas Nordberg The Spectral Clustering

More information

Chapter 2 Basic Structure of High-Dimensional Spaces

Chapter 2 Basic Structure of High-Dimensional Spaces Chapter 2 Basic Structure of High-Dimensional Spaces Data is naturally represented geometrically by associating each record with a point in the space spanned by the attributes. This idea, although simple,

More information