Package RaPKod. February 5, 2018

Similar documents
Package robustreg. R topics documented: April 27, Version Date Title Robust Regression Functions

Package ECctmc. May 1, 2018

Package inca. February 13, 2018

Package strat. November 23, 2016

Package orthodr. R topics documented: March 26, Type Package

Package densityclust

Package rococo. October 12, 2018

Package TVsMiss. April 5, 2018

Package rococo. August 29, 2013

Package CompGLM. April 29, 2018

Package kdevine. May 19, 2017

Package PUlasso. April 7, 2018

Package acebayes. R topics documented: November 21, Type Package

Package FPDclustering

Package PrivateLR. March 20, 2018

Package diagis. January 25, 2018

Package hsiccca. February 20, 2015

Package dissutils. August 29, 2016

Package GauPro. September 11, 2017

Package paralleldist

Package Rsomoclu. January 3, 2019

Package datasets.load

Package kernelfactory

Package ramcmc. May 29, 2018

Package naivebayes. R topics documented: January 3, Type Package. Title High Performance Implementation of the Naive Bayes Algorithm

Package flam. April 6, 2018

Package ipft. January 4, 2018

Package LVGP. November 14, 2018

Package rmi. R topics documented: August 2, Title Mutual Information Estimators Version Author Isaac Michaud [cre, aut]

Package pnmtrem. February 20, Index 9

Package validara. October 19, 2017

Package LINselect. June 9, 2018

Package apastyle. March 29, 2017

Package bayescl. April 14, 2017

Package misclassglm. September 3, 2016

Package madsim. December 7, 2016

Package gee. June 29, 2015

Package MixSim. April 29, 2017

Package gpr. February 20, 2015

Package distdichor. R topics documented: September 24, Type Package

Package libstabler. June 1, 2017

Package ldbod. May 26, 2017

Package spatialtaildep

Package bife. May 7, 2017

Package nmslibr. April 14, 2018

Package RPEnsemble. October 7, 2017

Package SimilaR. June 21, 2018

Package narray. January 28, 2018

Package intcensroc. May 2, 2018

Package mfbvar. December 28, Type Package Title Mixed-Frequency Bayesian VAR Models Version Date

Package sensory. R topics documented: February 23, Type Package

Package bisect. April 16, 2018

Package EBglmnet. January 30, 2016

Package coga. May 8, 2018

Package RSNPset. December 14, 2017

Package bigreg. July 25, 2016

Package geojsonsf. R topics documented: January 11, Type Package Title GeoJSON to Simple Feature Converter Version 1.3.

Package ordinalclust

Package svmpath. R topics documented: August 30, Title The SVM Path Algorithm Date Version Author Trevor Hastie

Package Numero. November 24, 2018

Package BigVAR. R topics documented: April 3, Type Package

Package Rvoterdistance

Package snplist. December 11, 2017

Package optimization

Package subplex. April 5, 2018

Package kirby21.base

Package Rtsne. April 14, 2017

Package bayesdp. July 10, 2018

Package kdetrees. February 20, 2015

Package areaplot. October 18, 2017

Package SparseFactorAnalysis

Package gwrr. February 20, 2015

Package GFD. January 4, 2018

Package fastrtext. December 10, 2017

Package TPD. June 14, 2018

Package spcadjust. September 29, 2016

Package ibm. August 29, 2016

Package glmnetutils. August 1, 2017

Package GADAG. April 11, 2017

Package readobj. November 2, 2017

Package CoClust. December 17, 2017

Package MatchIt. April 18, 2017

Package RcppEigen. February 7, 2018

Package RcppArmadillo

Package iccbeta. January 28, 2019

Package sbrl. R topics documented: July 29, 2016

Package RMCriteria. April 13, 2018

Package gibbs.met. February 19, 2015

Package sbf. R topics documented: February 20, Type Package Title Smooth Backfitting Version Date Author A. Arcagni, L.

Package FisherEM. February 19, 2015

Package assortnet. January 18, 2016

Package genelistpie. February 19, 2015

Package snn. August 23, 2015

Package rvinecopulib

Package ICSShiny. April 1, 2018

Package simsurv. May 18, 2018

Package fasttextr. May 12, 2017

Package sfc. August 29, 2016

Package dkdna. June 1, Description Compute diffusion kernels on DNA polymorphisms, including SNP and bi-allelic genotypes.

Package sgd. August 29, 2016

Transcription:

Package RaPKod February 5, 2018 Type Package Title Random Projection Kernel Outlier Detector Version 0.9 Date 2018-01-30 Author Jeremie Kellner Maintainer Jeremie Kellner <jeremie.kellner@ed.univ-lille1.fr> Description Kernel method that performs online outlier detection through random lowdimensional projections in a kernel space. Controls the probability of false alarm error. See Kellner J., ``Gaussian models and kernel methods'' (2016), PhD thesis for reference <https://orinuxeo.univ-lille1.fr/nuxeo/site/esupversions/789d119a-6763-4205-972f-b318cd4fdb27>. License GPL (>= 2.0) Imports Rcpp (>= 0.12.15) LinkingTo Rcpp, RcppArmadillo Depends proxy, kernlab, MASS NeedsCompilation yes Repository CRAN Date/Publication 2018-02-05 18:18:21 UTC R topics documented: RaPKod-package...................................... 2 od.opt.param........................................ 2 rapkod............................................ 4 Index 7 1

2 od.opt.param RaPKod-package Random Projection Kernel Outlier Detector Description The RaPKod package implements a kernel method made for outlier detection. Namely, given a data set of reference typical observation (non-outliers or inliers), it tests each new observation in an online way to determine whether it is an outlier or not. This method uses random low-dimensional projections in a kernel space to build a test statistic whose asymptotic null-distribution (ie when the tested observation is not an outlier) is known. The RaPKod method has two parameters: gamma - the hyperparameter of the (Gaussian) kernel used - and p - the dimensionality of the random projection in the kernel space. Details The package consists of two functions: the main function "rapkod" and the auxilary function "od.opt.param" which computes optimal parameters values in RaPKod. Author(s) Jeremie Kellner Maintainer: Jeremie Kellner <jeremie.kellner@ed.univ-lille1.fr> References Kellner J., "Gaussian Models and Kernel Methods", PhD thesis, Universite des Sciences et Technologies de Lille (defended on December 1st, 2016) See Also rapkod, od.opt.param od.opt.param Optimal Parameter Values In RaPKod Description Uses a heuristic formula to set optimal values for gamma and p. Usage od.opt.param(x, K1 = 6, K2 = 50, which.estim = "Gauss", RATIO = 0.1, randomize = TRUE, sub.n = floor(nrow(x)))

od.opt.param 3 Arguments X K1 Details Value a data frame or an n x d matrix. universal constant used in the heuristic formula of the optimal parameter gamma. K2 universal constant used in the heuristic formula of the optimal parameter p. which.estim RATIO randomize sub.n specifies the estimation method of the parameters: either "Gauss"(default) or "general". optional parameter used in estimation method "Gauss" optional parameter used in the estimation method "general". optional parameter used in the estimation method "general" if randomize=true. This function uses a heuristic formula to determine the optimal parameter values gamma and p, in the case when a Gaussian kernel is used. This formula is of the form gamma = K1 f 2/(d+2) 2 n 1/(d+2) and p = ceil(k2 f 2/(d+2) 2 n 2/(d+2) ), where f 2 is the L2-norm of the density function of non-outliers f and ceil(x) denotes the smallest integer larger than x. Two methods are proposed to estimate f 2 and are specified by the argument which.estim: "Gauss" and "general". If which.estim="gauss", the estimation is done as though f was a Gaussian density, which yields f 2/(d+2) 2 ) = (4 pi) 0.5 exp(0.5 mean(log(1/ev))), where ev are the covariance eigenvalues of the non-outlier distribution. Note that the eigenvalues smaller than ev[1] RAT IO (where ev[1] is the largest eigenvalue) are discarded to avoid numerical issues. If which.estim="general", f 2 is estimated without any assumption on f. However this method may fail in very high dimensions because of the dimensionality curse, since it relies on an estimation of the derivative of F at 0 where F is the cdf of the pairwise distance between two non-outliers.. Besides, to shorten the computation time, the optional argument randomize can be set as TRUE, so that only a subset of size sub.n of the data is considered to estimate the cdf F. See Also gamma.opt optimal value for gamma. p.opt optimal value for p. est.f2.pw rapkod Examples data(iris) estimation of f 2/(d+2) 2. ##Define data frame with non-outliers inliers = iris[sample(which(iris$species!="setosa"), 100, replace=false), -which(names(iris)=="species")]

4 rapkod param <- od.opt.param(inliers) #display optimal gamma param$gamma.opt #display optimal p param$p.opt rapkod RaPKod: Random Projections Kernel Outlier Detection Description Usage RaPKod is a kernel method for detecting outliers in a given dataset on the basis of a reference set of non-outliers. To do so, it transforms a tested observation into some kernel space (through a feature map ) and then projects it onto a random low-dimensional subspace of this kernel space. Since the distribution of this projection is known in the case of a non-outlier, it allows RaPKod to control the probability of false alarm error (ie labelling a non-outlier as an outlier). rapkod(x, given.kern = FALSE, ref.n=null, gamma=null, p=null, alpha = 0.05, use.tested.inlier = FALSE, lowrank = "No", r.lowrk = ceiling(sqrt(nrow(x))), K1 = 6, K2 = 50) Arguments X given.kern either a data frame or an n x d matrix (if given.kern=false), otherwise an n x n kernel matrix (if given.kern=true). In the former case, a Gaussian kernel is used by default. If FALSE (default), each row of X is an observation. Otherwise X is a kernel matrix (in this case, gamma and p must be user-specified). ref.n the size of the reference non-outlier dataset. Must be smaller than n. gamma the hyperparameter of the Gaussian kernel k(x, y) = exp( gamma x y 2 ). Set automatically by the program if not specified and given.kern=false. p the number of dimensions of the projection made in the kernel space. Set automatically by the program if not specified and given.kern=false. alpha the prescribed probability of false alarm error. use.tested.inlier If TRUE, each tested observation that is labelled as a non-outlier is appended to the reference dataset of non-outliers (the oldest reference non-outlier is discarded). Set to FALSE by default. lowrank if lowrank="no" (default), the full kernel matrix is used. Otherwise, a low-rank approximation of the kernel matrix is computed: if "Nyst", it is approximated through Nystrom method; if "RKS", it is approximated by random Kitchen Sinks (in this case, X must be a dataset matrix, not a kernel matrix)

rapkod 5 r.lowrk K1 Details Value if lowrank="nyst" or "RKS", specifies the (low) rank of the approximated kernel matrix. universal constant used in the heuristic formula of the optimal parameter gamma. K2 universal constant used in the heuristic formula of the optimal parameter p. If given.kern = FALSE, X is a dataset matrix whose first ref.n rows corresponds to the reference dataset of non-outliers. The (n - ref.n) other observations will be tested one by one by RaPKod to determine whether they are outliers or not. If given.kern = TRUE, X must be a n x n Gram matrix. The kernel used to compute this Gram matrix should be of the form k(x, y) = K(gamma x y 2 ) where K is a positive function. Also note that in this case, the parameters gamma and p must be specified by the user. stats flag pv gamma p a vector of length (n - ref.n) containing the test statistics for each tested observation. a vector of length (n - ref.n) indicating which observations have been labelled as an outlier (TRUE in this case). a vector of length (n - ref.n) containing p-values for each tested observation. the optimal value of gamma determined by the program (or the value provided by the user if it was user-specified). the optimal value of p determined by the program (or the value provided by the user if it was user-specified). See Also od.opt.param Examples data(iris) ##Define data frame with non-outliers inliers = iris[sample(which(iris$species!="setosa"), 100, replace=false), -which(names(iris)=="species")] ##Define data frame with outliers outliers = iris[which(iris$species=="setosa"),-which(names(iris)=="species")] X = rbind(inliers, outliers) ref.n = 50 result <- rapkod(x, ref.n = ref.n, use.tested.inlier = FALSE, alpha = 0.05) ##False alarm error ratio obtained on tested non-outliers (should be close to 0.05) mean(result$pv[1:(nrow(inliers)-ref.n)]<0.05, na.rm = TRUE)

6 rapkod ##Missed detection error ratio obtained on tested outliers (should be close to 0) mean(result$pv[-(1:(nrow(inliers)-ref.n))]>0.05, na.rm = TRUE)

Index od.opt.param, 2, 2, 5 RaPKod (RaPKod-package), 2 rapkod, 2, 3, 4 RaPKod-package, 2 7