Package ldbod. May 26, 2017

Similar documents
Package rmi. R topics documented: August 2, Title Mutual Information Estimators Version Author Isaac Michaud [cre, aut]

Anomaly Detection on Data Streams with High Dimensional Data Environment

Package bisect. April 16, 2018

Chapter 5: Outlier Detection

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Detection of Outliers

International Journal of Research in Advent Technology, Vol.7, No.3, March 2019 E-ISSN: Available online at

Package fusionclust. September 19, 2017

Machine Learning in the Wild. Dealing with Messy Data. Rajmonda S. Caceres. SDS 293 Smith College October 30, 2017

Package BayesCR. September 11, 2017

Package RaPKod. February 5, 2018

Package pwrab. R topics documented: June 6, Type Package Title Power Analysis for AB Testing Version 0.1.0

Package kdetrees. February 20, 2015

Package Kernelheaping

Mean-shift outlier detection

An Unsupervised Approach for Combining Scores of Outlier Detection Techniques, Based on Similarity Measures

Package RandPro. January 10, 2018

Package vinereg. August 10, 2018

OUTLIER MINING IN HIGH DIMENSIONAL DATASETS

Package edci. May 16, 2018

Computer Department, Savitribai Phule Pune University, Nashik, Maharashtra, India

Package Numero. November 24, 2018

PCA Based Anomaly Detection

Outlier Ensembles. Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY Keynote, Outlier Detection and Description Workshop, 2013

Package ECctmc. May 1, 2018

Package SSLASSO. August 28, 2018

Computer Technology Department, Sanjivani K. B. P. Polytechnic, Kopargaon

Package spark. July 21, 2017

Adaptive Sampling and Learning for Unsupervised Outlier Detection

Package fbroc. August 29, 2016

NDoT: Nearest Neighbor Distance Based Outlier Detection Technique

Package MatchIt. April 18, 2017

Package apastyle. March 29, 2017

Package robustreg. R topics documented: April 27, Version Date Title Robust Regression Functions

Package TPD. June 14, 2018

A Meta analysis study of outlier detection methods in classification

Package nonlinearicp

A Fast Randomized Method for Local Density-based Outlier Detection in High Dimensional Data

Package flsa. February 19, 2015

Package texteffect. November 27, 2017

AN IMPROVED DENSITY BASED k-means ALGORITHM

Package Modeler. May 18, 2018

I. INTRODUCTION II. RELATED WORK.

Package endogenous. October 29, 2016

Package LVGP. November 14, 2018

Package weco. May 4, 2018

Package cgh. R topics documented: February 19, 2015

Package kdevine. May 19, 2017

Package StVAR. February 11, 2017

Package esabcv. May 29, 2015

Applications of the k-nearest neighbor method for regression and resampling

Package SPIn. R topics documented: February 19, Type Package

Package hsiccca. February 20, 2015

Package mhde. October 23, 2015

Package ramcmc. May 29, 2018

Statistics 202: Data Mining. c Jonathan Taylor. Outliers Based in part on slides from textbook, slides of Susan Holmes.

Package DPBBM. September 29, 2016

Package restlos. June 18, 2013

Package datasets.load

Package densityclust

Package coga. May 8, 2018

Package mvpot. R topics documented: April 9, Type Package

Package nlnet. April 8, 2018

Unsupervised learning on Color Images

Package DCEM. September 30, 2018

Package dissutils. August 29, 2016

Contents. Preface to the Second Edition

Package ScottKnottESD

Package robets. March 6, Type Package

Package gibbs.met. February 19, 2015

Package ClusterBootstrap

Package deductive. June 2, 2017

Package ICSShiny. April 1, 2018

Package MOST. November 9, 2017

Keywords: Clustering, Anomaly Detection, Multivariate Outlier Detection, Mixture Model, EM, Visualization, Explanation, Mineset.

Package MfUSampler. June 13, 2017

Package kirby21.base

Package simr. April 30, 2018

Package dirmcmc. R topics documented: February 27, 2017

Package PDN. November 4, 2017

Package birtr. October 4, 2017

Package Kernelheaping

Package frequencyconnectedness

Package milr. June 8, 2017

Package SCORPIUS. June 29, 2018

An Enhanced Density Clustering Algorithm for Datasets with Complex Structures

Package condir. R topics documented: February 15, 2017

Package BigVAR. R topics documented: April 3, Type Package

Package bdots. March 12, 2018

Package feature. R topics documented: October 26, Version Date

Filtered Clustering Based on Local Outlier Factor in Data Mining

Package intcensroc. May 2, 2018

Package GFD. January 4, 2018

Package MIICD. May 27, 2017

Package PCADSC. April 19, 2017

Package strat. November 23, 2016

Package rpst. June 6, 2017

Package modmarg. R topics documented:

Package samplesizelogisticcasecontrol

Package alphashape3d

Transcription:

Type Package Title Local Density-Based Outlier Detection Version 0.1.2 Author Kristopher Williams Package ldbod May 26, 2017 Maintainer Kristopher Williams <kristopher.williams83@gmail.com> Description Flexible procedures to compute local density-based outlier scores for ranking outliers. Both exact and approximate nearest neighbor search can be implemented, while also accommodating multiple neighborhood sizes and four different local density-based methods. It allows for referencing a random subsample of the input data or a user specified reference data set to compute outlier scores against, so both unsupervised and semi-supervised outlier detection can be implemented. Depends R (>= 3.2.0) Imports stats, RANN, mnormt License GPL-3 URL https://github.com/kwilliams83/ldbod LazyData TRUE RoxygenNote 6.0.1 NeedsCompilation no Repository CRAN Date/Publication 2017-05-26 06:04:25 UTC R topics documented: ldbod............................................ 2 ldbod.ref........................................... 4 Index 8 1

2 ldbod ldbod Local Density-Based Outlier Detection using Subsampling with Approximate Nearest Neighbor Search Description This function computes local density-based outlier scores for input data. Usage ldbod(x, k = c(10, 20), nsub = nrow(x), method = c("lof", "ldf", "rkof", "lpdf"), ldf.param = c(h = 1, c = 0.1), rkof.param = c(alpha = 1, C = 1, sig2 = 1), lpdf.param = c(cov.type = "full", sigma2 = 1e-05, tmax = 1, v = 1), treetype = "kd", searchtype = "standard", eps = 0, scale.data = TRUE) Arguments X k nsub method ldf.param rkof.param lpdf.param treetype searchtype eps scale.data An n x p data matrix to compute outlier scores A vector of neighborhood sizes, k must be less than nsub Subsample size, nsub must be greater than k. Usually nsub = 0.10*n or larger is recommended. Default is nsub = n Character vector specifying the local density-based method(s) to compute. User can specify more than one method. By default all methods are computed Vector of parameters for method LDF. h is the positive bandwidth parameter and c is a positive scaling constant. Default values are h=1 and c=0.1 Vector of parameters for method RKOF. C is the postive bandwidth paramter, alpha is a sensitiveity parameter in the interval [0,1], and sig2 is the variance parameter. Default values are alpha=1, C=1, sig2=1 Vector of paramters for method LPDF. cov.type is the covariance parameterization type, which users can specifiy as either full or diag. sigma2 is the positive regularization parameter, tmax is the maximum number of updates, and v is the degrees of freedom for the multivariate t distribution. Default values are cov.type = full,tmax=1, sigma2=1e-5, and v=1. Character vector specifiying tree method. Either kd or bd tree may be specified. Default is kd. Refer to documentation for RANN package. Character vector specifiying knn search type. Default value is "standard". Refer to documentation for RANN package. Error bound. Default is 0.0 which implies exact nearest neighgour search. Refer to documentation for RANN package. Logical value indicating to scale each feature of X using standard noramlization with mean 0 and standard deviation of 1

ldbod 3 Details Value Computes the local density-based outlier scores for input data, X, referencing a random subsample of X. The subsampled data set is constructed by randomly drawning nsub samples from X without replacement. Four different methods can be implemented LOF, LDF, RKOF, and LPDF. Each method specified returns densities and relative densities. Methods LDF and RKOF uses guassian kernels, and method LDPF uses multivarite t distribution. Outlier scores returned are positive except for lpde and lpdr which are log scaled densities (natural log). Score lpdr has shown to be highly sensitive to k. All knn computations are carried out using the nn2() function from the RANN package. Multivariate t densities are computed using the dmt() function from the mnormt package. Refer to specific packages for more details. Note: all neighborhoods are strickly of size k; therefore, the algorithms for LOF, LDF, and RKOF are not exact implementations, but algorithms are similiar for most situation and are equivalent when distance to k-th nearest neighbor is unique. If there are many duplicate data points, then implementation of algorithms could lead to dramatically different (positive or negative) results than those that allow neighborhood sizes larger than k, especially if k is relatively small. Removing duplicates is recommended before computing outlier scores unless there is good reason to keep them. The algorithm can be used to compute an ensemble of outlier scores by using multiple k values and/or iterating over multiple subsamples. A list of length 9 with the elements: lrd An n x length(k) matrix where each column vector represents the local reachabiility denity (LRD) outlier scores for each specifed lof An n x length(k) matrix where each column vector represents the local outlier factor (LOF) outlier scores for each specifed k value. Larger values indicate a point in more outlying. lde An n x length(k) matrix where each column vector represents the local density estimate (LDE) outlier scores for each specifed ldf An n x length(k) matrix where each column vector represents the local density factor (LDF) outlier scores for each specifed k value. Larger values indicate a point in more outlying. kde An n x length(k) matrix where each column vector represents the kernel density estimate (KDE) outlier scores for each specifed rkof An n x length(k) matrix where each column vector represents the robust kernel density factor (RKOF) outlier scores for each specifed k value. Larger values indicate a point in more outlying. lpde An n x length(k) matrix where each column vector represents the local parametric density estimate (LPDE) outlier scores for each specifed k value on log scale. Smaller values indicate a point in more outlying. lpdf An n x length(k) matrix where each column vector represents the local parametric density factor (LPDF) outlier scores for each specifed k value. Smaller values indicate a point in more outlying. lpdr An n x length(k) matrix where each column vector represents the local parametric density ratio (LPDR) outlier scores for each specifed k value. Smaller values indicate a point in more outlying. LPDR is typically used to detect groups of outliers. If a method is not specified then returns NULL

4 ldbod.ref References M. M. Breunig, H-P. Kriegel, R.T. Ng, and J. Sander (2000). LOF: Identifying density-based local outliers. In Proc. of ACM International Conference on Knowledge Discovery and Data Mining, 93-104. L. J. Latecki, A. Lazarevic, and D. Pokrajac (2007). Outlier Detection with kernel density functions. In Proc. of Machine Learning and Data Mining in Pattern Recognition, 61-75 J. Gao, W. Hu, Z. Zhang, X. Zhang, and O. Wu (2011). RKOF: Robust kernel-based local outlier detection. In Proc. of Advances in Knowledge Discovery and Data Mining, 270-283. K. T. Williams (2016). Local parametric density-based outlier deteciton and ensemble learning with application to malware detection. PhD Dissertation. The University of Texas at San Antonio. Examples # 500 x 2 data matrix X <- matrix(rnorm(1000),500,2) # five outliers outliers <- matrix(c(rnorm(2,20),rnorm(2,-12),rnorm(2,-8),rnorm(2,-5),rnorm(2,9)),5,2) X <- rbind(x,outliers) # compute outlier scores without subsampling for all methods using neighborhood size of 50 scores <- ldbod(x, k=50) head(scores$lrd); head(scores$rkof) # plot data and highlight top 5 outliers retured by lof top5outliers <- X[order(scores$lof,decreasing=TRUE)[1:5],] # plot data and highlight top 5 outliers retured by outlier score lpde top5outliers <- X[order(scores$lpde,decreasing=FALSE)[1:5],] # compute outlier scores for k= 10,20 with 10% subsampling for methods 'lof' and 'lpdf' scores <- ldbod(x, k = c(10,20), nsub = 0.10*nrow(X), method = c('lof','lpdf')) # plot data and highlight top 5 outliers retuned by lof for k=20 top5outliers <- X[order(scores$lof[,2],decreasing=TRUE)[1:5],] ldbod.ref Local Density-Based Outlier Detection using Reference Data with Approximate Nearest Neighbor Search

ldbod.ref 5 Description This function computes local density-based outlier scores for input data and user specified reference set. Usage ldbod.ref(x, Y, k = c(10, 20), method = c("lof", "ldf", "rkof", "lpdf"), ldf.param = c(h = 1, c = 0.1), rkof.param = c(alpha = 1, C = 1, sig2 = 1), lpdf.param = c(cov.type = "full", sigma2 = 1e-05, tmax = 1, v = 1), treetype = "kd", searchtype = "standard", eps = 0, scale.data = TRUE) Arguments X Y An n x p data matrix to compute outlier scores An m x p reference data matrix. k A vector of neighborhood sizes, k must be less than m. method ldf.param rkof.param lpdf.param treetype searchtype eps Details scale.data Character vector specifying the local density-based method(s) to compute. User can specify more than one method. By default all methods are computed Vector of parameters for method LDF. h is the positive bandwidth parameter and c is a positive scaling constant. Default values are h=1 and c=0.1 Vector of parameters for method RKOF. C is the postive bandwidth paramter, alpha is a sensitiveity parameter in the interval [0,1], and sig2 is the variance parameter. Default values are alpha=1, C=1, sig2=1 Vector of paramters for method LPDF. cov.type is the covariance parameterization type, which users can specifiy as either full or diag. sigma2 is the positive regularization parameter, tmax is the maximum number of updates, and v is the degrees of freedom for the multivariate t distribution. Default values are cov.type = full,tmax=1, sigma2=1e-5, and v=1. Character vector specifiying tree method. Either kd or bd tree may be specified. Default is kd. Refer to documentation for RANN package. Character vector specifiying knn search type. Default value is "standard". Refer to documentation for RANN package. Error bound. Default is 0.0 which implies exact nearest neighgour search. Refer to documentation for RANN package. Logical value indicating to scale each feature of X using standard noramlization based on mean and standard deviation for features of Y. Computes local density-based outlier scores for input data, X, referencing data Y. For semi-supervised outlier detection Y would be a set of "normal" reference points; otherwise, Y can be any other set of reference points of interest. This allows users the flexibility to reference other data sets besides X or a subset of X. Four different methods can be implemented LOF, LDF, RKOF, and LPDF. Each method specified returns densities and relative densities. Methods LDF and RKOF uses guassian kernels, and method LDPF uses multivarite t distribution. Outlier scores returned are non-negative

6 ldbod.ref Value except for lpde and lpdr which are log scaled densities (natural log). Note: Outlier score lpdr is strictly designed for unsupervised outlier detection and should not be used in the semi-supervised setting. Refer to references for more details about each method. All knn computations are carried out using the nn2() function from the RANN package. Multivariate t densities are computed using the dmt() function from the mnormt package. Refer to specific packages for more details. Note: all neighborhoods are strickly of size k; therefore, the algorithms for LOF, LDF, and RKOF are not exact implementations, but algorithms are similiar for most situation and are equivalent when distance to k-th nearest neighbor is unique. If there are many duplicate data points in Y, then implementation of algorithms could lead to dramatically different (positive or negative) results than those that allow neighborhood sizes larger than k, especially if k is relatively small. Removing duplicates is recommended before computing outlier scores unless there is good reason to keep them. The algorithm can be used to compute an ensemble of unsupervised outlier scores by using multiple k values and/or multiple iterations of reference data. A list of length 9 with the elements: lrd An n x length(k) matrix where each column vector represents outlier scores for each specifed lof An n x length(k) matrix where each column vector represents outlier scores for each specifed k value. Larger values indicate a point in more outlying. lde An n x length(k) matrix where each column vector represents outlier scores for each specifed ldf An n x length(k) matrix where each column vector represents outlier scores for each specifed k value. Larger values indicate a point in more outlying. kde An n x length(k) matrix where each column vector represents outlier scores for each specifed rkof An n x length(k) matrix where each column vector represents outlier scores for each specifed k value. Larger values indicate a point in more outlying. lpde An n x length(k) matrix where each column vector represents outlier scores for each specifed lpdf An n x length(k) matrix where each column vector represents outlier scores for each specifed lpdr An n x length(k) matrix where each column vector represents outlier scores for each specifed If a method is not specified then returns NULL References M. M. Breunig, H-P. Kriegel, R.T. Ng, and J. Sander (2000). LOF: Identifying density-based local outliers. In Proc. of ACM International Conference on Knowledge Discovery and Data Mining, 93-104. L. J. Latecki, A. Lazarevic, and D. Pokrajac (2007). Outlier Detection with kernel density functions. In Proc. of Machine Learning and Data Mining in Pattern Recognition, 61-75

ldbod.ref 7 J. Gao, W. Hu, Z. Zhang, X. Zhang, and O. Wu (2011). RKOF: Robust kernel-based local outlier detection. In Proc. of Advances in Knowledge Discovery and Data Mining, 270-283. K. T. Williams (2016). Local parametric density-based outlier deteciton and ensemble learning with application to malware detection. PhD Dissertation. The University of Texas at San Antonio. Examples # 500 x 2 data matrix X <- matrix(rnorm(1000),500,2) Y <- X # five outliers outliers <- matrix(c(rnorm(2,20),rnorm(2,-12),rnorm(2,-8),rnorm(2,-5),rnorm(2,9)),5,2) X <- rbind(x,outliers) # compute outlier scores referencing Y for all methods using a neighborhood size of 50 scores <- ldbod.ref(x,y, k=50) head(scores$lrd); head(scores$rkof) # plot data and highlight top 5 outliers retured by lof top5outliers <- X[order(scores$lof,decreasing=TRUE)[1:5],] # plot data and highlight top 5 outliers retured by outlier score lpde top5outliers <- X[order(scores$lpde,decreasing=FALSE)[1:5],] # compute outlier scores for k= 10,20 referencing Y for methods 'lof' and 'lpdf' scores <- ldbod.ref(x,y, k = c(10,20), method = c('lof','lpdf')) # plot data and highlight top 5 outliers retuned by lof for k=20 top5outliers <- X[order(scores$lof[,2],decreasing=TRUE)[1:5],]

Index ldbod, 2 ldbod.ref, 4 8